[
https://issues.apache.org/jira/browse/FLINK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777549#comment-16777549
]
zhijiang commented on FLINK-6227:
---------------------------------
The initial motivation of this task is to give a hint/indicator for
{{FailoverStrategy}} in {{JobMaster}} especially in batch job.
For example, in pipelined job, once the downstream task failed, we always need
to restart the corresponding pipelined upstream task to re-produce the data,
because the produced data is not persistent.
But for batch job, the upstream task might produce the data in persistent local
disk. And the downstream task is scheduled to consume the data until the
upstream task finishes. If the downstream task fails during consuming the
partition data, we might have two options to decide whether to restart the
upstream task to re-produce the data.
If the upstream produced data is still available in local disk, it is no need
to restart the task to re-produce the data. In another case if the produced
data is corrupted in disk or not available for other reasons, we need to
restart the upstream task. For this case, we introduce the
{{DataConsumptionException}} to indicate whether restarting the upstream task
or not for {{JobMaster}}. In current implementation, the sub partition view
would read the persistent data from local disk, and wrap certain exceptions
such as {{IOException}} into {{DataConsumptionException}} during reading files.
And this exception would be transferred via network to downstream side. The
downstream side would fail and then notify the JobMaster of this exception.
But I think this task might out-of-date considering recent other features. We
propose pluggable {{ShuffleManager}} feature recently, and it would bring
{{ShuffleMaster}} component in {{JobMaster}} to manage global partitions. Then
it might be feasible to provide some features for giving some hints of
partition available to FailoverStrategy to make the decision. You could
concentrate on this feature [1] if interested.
[1] https://issues.apache.org/jira/browse/FLINK-10653
> Introduce the DataConsumptionException for downstream task failure
> ------------------------------------------------------------------
>
> Key: FLINK-6227
> URL: https://issues.apache.org/jira/browse/FLINK-6227
> Project: Flink
> Issue Type: Sub-task
> Components: Network
> Reporter: zhijiang
> Assignee: Aleksandr Salatich
> Priority: Minor
>
> It is part of FLIP-1.
> We define a new special exception to indicate the downstream task failure in
> consuming upstream data.
> The {{JobManager}} will receive and consider this special exception to
> calculate the minimum connected sub-graph which is called {{FailoverRegion}}.
> So the {{DataConsumptionException}} should contain {{ResultPartitionID}}
> information for {{JobMaster}} tracking upstream executions.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)