[GitHub] [flink] isunjin opened pull request #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource…

GitHub Wed, 12 Sep 2018 05:56:53 -0700

## What is the purpose of the change

    Today DataSource Task pull InputSplits from JobManager to achieve better
    performance, however, when a DataSourceTask failed and rerun, it will
    not get the same splits as its previous version. this will introduce
    inconsistent result or even data corruption.


    Furthermore,  if there are two executions run at the same time (in batch
    scenario), this two executions should process same splits.

    we need to fix the issue to make the inputs of a DataSourceTask
    deterministic. The propose is save all splits into ExecutionVertex and
    DataSourceTask will pull split from there.


## Brief change log

  - *JobMaster getNextInputSplit from Execution*
  - *Execution forward getNextInputSplit and the sequence number of the request 
to ExecutionVertex*
  - *If the sequence number exist in the ExecutionVertex return, else calculate 
and cache*

## Verifying this change

This change added tests and can be verified as follows:
  - *covered by existing test*
  - *Added a new test that validates the scenario that getNextInputSplit 
multiple times with different Execution attempts per ExecutionVertex*

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (yes / **no**)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
  - The serializers: (yes / **no** / don't know)
  - The runtime per-record code paths (performance sensitive): (yes / **no** / 
don't know)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
  - The S3 file system connector: (yes / **no** / don't know)

## Documentation

  - Does this pull request introduce a new feature? (yes / **no**)


[ Full content available at: https://github.com/apache/flink/pull/6684 ]
This message was relayed via gitbox.apache.org for devnull@infra.apache.org

[GitHub] [flink] isunjin opened pull request #6684: [FLINK-10205] Batch Job: InputSplit Fault tolerant for DataSource…

Reply via email to