[ https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653060#comment-16653060 ]
ASF GitHub Bot commented on FLINK-10205: ---------------------------------------- tillrohrmann commented on issue #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource… URL: https://github.com/apache/flink/pull/6684#issuecomment-430508466 Before moving forward I'd like to understand why it is strictly necessary that the failed tasks reprocesses the same set of input splits. Is it because streaming sources can have state which they would use to filter out already processed splits? In the batch case, this should not be a problem because it should not matter which tasks processes which input split. If a failure occurs and some other task takes over the failed input splits, it would as if this task had processed these input splits from the very beginning. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Batch Job: InputSplit Fault tolerant for DataSourceTask > ------------------------------------------------------- > > Key: FLINK-10205 > URL: https://issues.apache.org/jira/browse/FLINK-10205 > Project: Flink > Issue Type: Sub-task > Components: JobManager > Affects Versions: 1.6.1, 1.7.0, 1.6.2 > Reporter: JIN SUN > Assignee: JIN SUN > Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Today DataSource Task pull InputSplits from JobManager to achieve better > performance, however, when a DataSourceTask failed and rerun, it will not get > the same splits as its previous version. this will introduce inconsistent > result or even data corruption. > Furthermore, if there are two executions run at the same time (in batch > scenario), this two executions should process same splits. > we need to fix the issue to make the inputs of a DataSourceTask > deterministic. The propose is save all splits into ExecutionVertex and > DataSourceTask will pull split from there. > document: > [https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005)