[ 
https://issues.apache.org/jira/browse/FLINK-14606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969932#comment-16969932
 ] 

Zhu Zhu commented on FLINK-14606:
---------------------------------

Thanks [~chesnay] for sharing the thoughts. 
My bad that in the description I talked too much on why this change would work, 
but not enough for the motivation to do it. 
The motivation is actually to fix some potential issues. If the issues are not 
valid or there are other better ways to solve it, it's fine ignore the initial 
proposal.

Potential issues:
1. tasks may not be canceled if a task deployment succeeded at TM, while the 
success response to JM is lost. (isCallback==true in this case, see the 
[execution deploy 
logic|https://github.com/apache/flink/blob/b0a9afdd24fb70131b1e80d46d0ca101235a4a36/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L739])
2. From entry {{Execution#markFailed(Throwable)}}, the task is thought to not 
be deployed, thus isCallback==true and  {{sendCancelRpcCall}} will not be 
invoked. But we will still releasing partitions since releasePartitions==true, 
even though the task is not deployed and the partition then should not have 
been created. Would it matter if we set releasePartitions to be false in this 
case to reduce RPC calls?

For the {{fromSchedulerNg}} flag, agreed that we can keep it as is since the 
benefit to unify it is not obvious.




> Simplify params of Execution#processFail
> ----------------------------------------
>
>                 Key: FLINK-14606
>                 URL: https://issues.apache.org/jira/browse/FLINK-14606
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Priority: Major
>             Fix For: 1.10.0
>
>
> The 3 params fromSchedulerNg/releasePartitions/isCallback of 
> Execution#processFail are quite a mess while they seem to be correlated. 
> I'd propose to simplify the prams of processFail by using a 
> {{isInternalError}} to replace those 3 params. {{isInternalError}} is true 
> iff the failure is from TM(strictly speaking, notified from SchedulerBase). 
> This also hardens the handling of cases that a task is successfully deployed 
> but JM does not realize it(see #3 below).
> Here's why these 3 params can be simplified:
> 1. {{fromSchedulerNg}}, true iff the failure is from TM and 
> isLegacyScheduling==false.
>     It's only used like this: {{if (!fromSchedulerNg && 
> !isLegacyScheduling()))}}. So it's the same to use {{!isInternalFailure}} to 
> replace it.
> 2. {{releasePartitions}}, true iff the failure is from TM.
>   Now the value is exactly the same as {{isInternalFailure}}, we can drop it 
> and use {{isInternalFailure}} instead.
> 3. {{isCallback}}, true iff the failure is from TM or the task is not 
> deployed.
>     It's only used like this: {{(!isCallback && (current == RUNNING || 
> current == DEPLOYING))}}.
>     So using {{!isInternalFailure}} to replace it would be enough. It is a 
> bit different for the case that a task deployment to a task manager fails, 
> which set {{isCallback}} to true previously. However, it would be safer to 
> signal a cancel call, in case the deployment is actually a success but the 
> response is lost on network.
> cc [~GJL]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to