[
https://issues.apache.org/jira/browse/FLINK-12002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801504#comment-16801504
]
Jeff Zhang edited comment on FLINK-12002 at 3/26/19 9:17 AM:
-------------------------------------------------------------
This is a very useful feature, specially for session mode where there will be
multiple jobs and some jobs may be large while some may be small. Looking
forward the design.
was (Author: zjffdu):
This is a very useful features, specially for session mode where there will be
multiple jobs and some jobs may be large while some may be small. Looking
forward the design.
> Adaptive Parallelism of Job Vertex Execution
> --------------------------------------------
>
> Key: FLINK-12002
> URL: https://issues.apache.org/jira/browse/FLINK-12002
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Operators
> Reporter: ryantaocer
> Assignee: BoWang
> Priority: Major
>
> In Flink the parallelism of job is a pre-specified parameter, which is
> usually an empirical value and thus might not be optimal for both performance
> and resource depending on the amount of data processed in each task.
> Furthermore, a fixed parallelism cannot scale to varying data size common in
> production cluster where we may not often change configurations.
> We propose to determine the job parallelism adaptive to the actual total
> input data size and an ideal data size processed by each task. The ideal size
> is pre-specified according to the properties of the operator such as the
> preparation overhead compared with data processing time.
> Our basic idea of "split and merge" is to make the data dispatched evenly
> acorss Reducers by spliting and/or merging data buckets produced by Map. The
> data density skew problem is not covered. This kind of parallelism adjustment
> doesn't have data correctness issue since it doesnt' break the condition that
> data with the same key is processed by a single task. We determine the
> proper parallelism of Reduce during scheduling before its actual running and
> after its input been ready though not necessary total input data. In such
> context, apdative parallelism is a better name. This scheduling improvement
> we think can benefit both batch and stream as long as we can obtain some
> clues about the input data.
>
> detailed design doc coming soon.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)