[ 
https://issues.apache.org/jira/browse/FLINK-14676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972052#comment-16972052
 ] 

Jingsong Lee commented on FLINK-14676:
--------------------------------------

Let me give more background:

1. our goal of 1.10  is to be production for batch sql.

2. In batch processing scenarios, unlike streams, batches have many dimension 
tables (sources that need to be scanned), which are often very small. If we 
still use the current unified parallelism method, a large number of shuffle 
files will be generated, and too much parallelism will seriously affect 
performance, which will directly lead that we can not set high parallelism.

3. I don't think it's a very intermediate solution. Whether it's spark / Presto 
/ hive, they all set source parallelism separately, while the intermediate 
nodes are usually unified parallelism.

> Introduce parallelism inference for InputFormatTableSource
> ----------------------------------------------------------
>
>                 Key: FLINK-14676
>                 URL: https://issues.apache.org/jira/browse/FLINK-14676
>             Project: Flink
>          Issue Type: New Feature
>          Components: Table SQL / Planner
>            Reporter: Jingsong Lee
>            Assignee: Jingsong Lee
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.10.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> FLINK-12801 has introduce parallelism setting for table, but because 
> TableSource generate DataStream, maybe DataStream is not a real source, that 
> will lead to some shuffle errors. So FLINK-13494 remove these implementations.
> In this ticket, I would like to introduce parallelism inference only for 
> InputFormatTableSource, the RowCount of InputFormatTableSource is more 
> accurate than downstream stages. It is worth to automatically generate its 
> parallelism.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to