[jira] [Commented] (FLINK-10038) Parallel the creation of InputSplit if necessary

JIRA Thu, 30 Aug 2018 02:07:20 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597215#comment-16597215
 ]


陈梓立 commented on FLINK-10038:
-----------------------------

My original purpose of mention "parallelize the creation of InputSplit" might 
be parallelize the creation of ONE InputSplit. Take a look at 
{{FileInputFormat#createInputSplits}}, it creates InputSplits file by file. 
Here is where I aim to parallelize. Thus it said "the interface for the 
creation of input splits is definitely InputSplitSource#createInputSplits". And 
this could be done without modify the interface, by change the implementation 
of {{createInputSplits}}.

However, your ideas here are also brightly. Let's say a typical case gain 
benefits from these ideas is batch job with many files, where would prefer to 
using RegionFailover strategy if possible.
Here I see 3 options. 1. create InputSplits before job running. 2. create 
InputSplits concurrent to scheduling the job. 3. Use a specific single task to 
generate the work.

Option 1 is easier to implement as [~StephanEwen] said. Below with concrete 
challenges for the rest options.

The main issue I concern is in batch job, we prefer not to cancelling all 
vertices and restart. What's worse, since we don't have batch checkpoint, the 
batch job has to restart completely. This is unacceptable for large scale batch 
job.
For option2, what if jm failover after some input splits have been computed and 
sent off? We don't have specific jm failover strategy now, thus it cause the 
job completely restarted. By continue this option, it leads to discuss A jm 
failover strategy, that is, when jm failover and restart, it can 
recover(reconcile) state from the previous one.
For option3, there would be a wider consider about Source. Take two input case 
into consider(below). Currently we read from source blocking, now we compute 
the input split as a single task, if we still use blocking approach, the 
downstream maybe stuck for waiting one input while the other input is ready to 
be read.

Src1 ----\
Src2---->Join

One way to solve this issue is we read from the source unblocking. Assume 
introduce a method {{boolean SourceFunction#next(Collector<T>)}}, when the 
downstream calling it, the source sent its data to the collector and return 
true. If there remains no more data, it return false. This also async source 
read from file and produce data.

To sum up, focusing more on batch job, the main issue concerned would be jm 
failover for option 1 and 2(also extern but significant batch checkpoint), and 
more flexible source for option 3.

> Parallel the creation of InputSplit if necessary
> ------------------------------------------------
>
>                 Key: FLINK-10038
>                 URL: https://issues.apache.org/jira/browse/FLINK-10038
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: 陈梓立
>            Priority: Major
>              Labels: improvement, inputformat, parallel, perfomance
>
> As a continue to the discussion in the PR about parallelize the creation of 
> ExecutionJobVertex [here|https://github.com/apache/flink/pull/6353].
> [~StephanEwen] suggested that we could parallelize the creation of 
> InputSplit, from which we gain performance improvements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10038) Parallel the creation of InputSplit if necessary

Reply via email to