[jira] [Commented] (FLINK-6227) Introduce the DataConsumptionException for downstream task failure

zhijiang (JIRA) Mon, 25 Feb 2019 19:52:18 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777549#comment-16777549
 ]


zhijiang commented on FLINK-6227:
---------------------------------

The initial motivation of this task is to give a hint/indicator for 
{{FailoverStrategy}} in {{JobMaster}} especially in batch job.

For example, in pipelined job, once the downstream task failed, we always need 
to restart the corresponding pipelined upstream task to re-produce the data, 
because the produced data is not persistent.

But for batch job, the upstream task might produce the data in persistent local 
disk. And the downstream task is scheduled to consume the data until the 
upstream task finishes. If the downstream task fails during consuming the 
partition data, we might have two options to decide whether to restart the 
upstream task to re-produce the data.

If the upstream produced data is still available in local disk, it is no need 
to restart the task to re-produce the data. In another case if the produced 
data is corrupted in disk or not available for other reasons, we need to 
restart the upstream task. For this case, we introduce the 
{{DataConsumptionException}} to indicate whether restarting the upstream task 
or not for {{JobMaster}}. In current implementation, the sub partition view 
would read the persistent data from local disk, and wrap certain exceptions 
such as {{IOException}} into {{DataConsumptionException}} during reading files. 
And this exception would be transferred via network to downstream side. The 
downstream side would fail and then notify the JobMaster of this exception.

But I think this task might out-of-date considering recent other features. We 
propose pluggable {{ShuffleManager}} feature recently, and it would bring 
{{ShuffleMaster}} component in {{JobMaster}} to manage global partitions. Then 
it might be feasible to provide some features for giving some hints of 
partition available to FailoverStrategy to make the decision. You could 
concentrate on this feature [1]  if interested.

[1] https://issues.apache.org/jira/browse/FLINK-10653

 

> Introduce the DataConsumptionException for downstream task failure
> ------------------------------------------------------------------
>
>                 Key: FLINK-6227
>                 URL: https://issues.apache.org/jira/browse/FLINK-6227
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Network
>            Reporter: zhijiang
>            Assignee: Aleksandr Salatich
>            Priority: Minor
>
> It is part of FLIP-1.
> We define a new special exception to indicate the downstream task failure in 
> consuming upstream data. 
> The {{JobManager}} will receive and consider this special exception to 
> calculate the minimum connected sub-graph which is called {{FailoverRegion}}. 
> So the {{DataConsumptionException}} should contain {{ResultPartitionID}} 
> information for {{JobMaster}} tracking upstream executions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-6227) Introduce the DataConsumptionException for downstream task failure

Reply via email to