[jira] [Commented] (FLINK-19150) Behaviour change after migration from 1.9 to 1.11

Aljoscha Krettek (Jira) Mon, 07 Sep 2020 01:52:11 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17191581#comment-17191581
 ]


Aljoscha Krettek commented on FLINK-19150:
------------------------------------------

I'm afraid this is intentional. Let me try and understand your issue better:
 - You're using YARN
 - In Flink 1.9 you were using "attached per-job mode"
 - The original client submits the job and waits for it to complete
 - In between, you start another client to also request the job result
 - The original client will shut down the cluster when the job finishes

Is that correct?

Now, to explain the changed behaviour. In Flink 1.9 "per-job attached" mode was 
a pseudo per-job mode: Behind the scenes, Flink would start a session cluster, 
submit the job to that cluster, shutdown the cluster from the client once the 
job finishes. We thought this could lead to problems if the client crashes or 
if the connection to the cluster breaks. In those cases nobody would shut down 
the cluster once the job finishes.

We, therefore, turned "per-job attached" mode into a real per-job mode in Flink 
1.10: The cluster is started with the job and once the job finished (and 
someone retrieves the result) the cluster shuts down. I don't like that we're 
waiting for result retrieval here, but see my comment on another Jira issue for 
that: 
https://issues.apache.org/jira/browse/FLINK-18959?focusedCommentId=17186453&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17186453

Why do you need to retrieve the job result from another client as well? Maybe 
we can find a better solution for this.

> Behaviour change after migration from 1.9 to 1.11
> -------------------------------------------------
>
>                 Key: FLINK-19150
>                 URL: https://issues.apache.org/jira/browse/FLINK-19150
>             Project: Flink
>          Issue Type: Bug
>          Components: Client / Job Submission
>    Affects Versions: 1.11.0
>            Reporter: Jiayi Liao
>            Priority: Major
>
> In Flink 1.9, if we submit a job in attach mode, the client will help to shut 
> down the cluster in the end. While in Flink 1.11, the cluster will be shut 
> down as long as the job result is requested, no matters where the request 
> comes from.
> Currently we've found that the client cannot succeed to get the job execution 
> result and report connection lost error because the result has already been 
> requested and the cluster is shut down. The root cause is under investigation 
> but we think it might be related to the network environment since the 
> frequency of occurance is very low.
> I'm aware of that there's a big restructure on the job submission from Flink 
> 1.9 to 1.11. But is this change expected?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19150) Behaviour change after migration from 1.9 to 1.11

Reply via email to