[
https://issues.apache.org/jira/browse/FLINK-10038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597215#comment-16597215
]
陈梓立 commented on FLINK-10038:
-----------------------------
My original purpose of mention "parallelize the creation of InputSplit" might
be parallelize the creation of ONE InputSplit. Take a look at
{{FileInputFormat#createInputSplits}}, it creates InputSplits file by file.
Here is where I aim to parallelize. Thus it said "the interface for the
creation of input splits is definitely InputSplitSource#createInputSplits". And
this could be done without modify the interface, by change the implementation
of {{createInputSplits}}.
However, your ideas here are also brightly. Let's say a typical case gain
benefits from these ideas is batch job with many files, where would prefer to
using RegionFailover strategy if possible.
Here I see 3 options. 1. create InputSplits before job running. 2. create
InputSplits concurrent to scheduling the job. 3. Use a specific single task to
generate the work.
Option 1 is easier to implement as [~StephanEwen] said. Below with concrete
challenges for the rest options.
The main issue I concern is in batch job, we prefer not to cancelling all
vertices and restart. What's worse, since we don't have batch checkpoint, the
batch job has to restart completely. This is unacceptable for large scale batch
job.
For option2, what if jm failover after some input splits have been computed and
sent off? We don't have specific jm failover strategy now, thus it cause the
job completely restarted. By continue this option, it leads to discuss A jm
failover strategy, that is, when jm failover and restart, it can
recover(reconcile) state from the previous one.
For option3, there would be a wider consider about Source. Take two input case
into consider(below). Currently we read from source blocking, now we compute
the input split as a single task, if we still use blocking approach, the
downstream maybe stuck for waiting one input while the other input is ready to
be read.
Src1 ----\
Src2---->Join
One way to solve this issue is we read from the source unblocking. Assume
introduce a method {{boolean SourceFunction#next(Collector<T>)}}, when the
downstream calling it, the source sent its data to the collector and return
true. If there remains no more data, it return false. This also async source
read from file and produce data.
To sum up, focusing more on batch job, the main issue concerned would be jm
failover for option 1 and 2(also extern but significant batch checkpoint), and
more flexible source for option 3.
> Parallel the creation of InputSplit if necessary
> ------------------------------------------------
>
> Key: FLINK-10038
> URL: https://issues.apache.org/jira/browse/FLINK-10038
> Project: Flink
> Issue Type: Improvement
> Components: Distributed Coordination
> Affects Versions: 1.5.0
> Reporter: 陈梓立
> Priority: Major
> Labels: improvement, inputformat, parallel, perfomance
>
> As a continue to the discussion in the PR about parallelize the creation of
> ExecutionJobVertex [here|https://github.com/apache/flink/pull/6353].
> [~StephanEwen] suggested that we could parallelize the creation of
> InputSplit, from which we gain performance improvements.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)