[jira] [Commented] (FLINK-14676) Introduce parallelism inference for InputFormatTableSource

Jingsong Lee (Jira) Mon, 11 Nov 2019 20:01:09 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-14676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972044#comment-16972044
 ]


Jingsong Lee commented on FLINK-14676:
--------------------------------------

Hi [~jark], thanks for you explain.

> let connectors to configure parallelism itself

First, InputFormatTableSource can not configure parallelism itself. Second, 
source can not get statistics, only catalog table contains statistics. Third, 
if we let connectors to configure parallelism itself, we need introduce 
configuration to connectors too, connector may have different config options.

> why the parallelism is inferred by row_count/rows_per_partition? 

Consider auto infer, is there any better indicator? Considering the CPU 
consumption of the process, In the SQL processing scenario, more CPUs are based 
on per record.

> What if the rowCount is empty or wrong?

rowCount of input format table source should be accurate, if rowCount is empty, 
there should be a way to find out this situation. CC: [~godfreyhe] 

Another thing, there is a switch: infer.mode.

> How to guarantee each partition process such number of rows?

Why we need guarantee? There is nothing need to be guaranteed, consider input 
splits, partitions will spend as much as possible according to their own 
processing input splits.

> partitioned source

partitioned source need update statistic after partition pruning.

> This may diverge stream and batch.

I don't think this can unify stream and batch, the statistics of stream source 
are totally different from batch. It is hard to say there is a row count for 
streaming source. 

> exposing configuration to set source parallelism directly

I think it is not a good way,  so many sources in graph, it is hard to let user 
to do it. If you want to discuss it, infer.mode is introduced to open/close 
infer.

> Introduce parallelism inference for InputFormatTableSource
> ----------------------------------------------------------
>
>                 Key: FLINK-14676
>                 URL: https://issues.apache.org/jira/browse/FLINK-14676
>             Project: Flink
>          Issue Type: New Feature
>          Components: Table SQL / Planner
>            Reporter: Jingsong Lee
>            Assignee: Jingsong Lee
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.10.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> FLINK-12801 has introduce parallelism setting for table, but because 
> TableSource generate DataStream, maybe DataStream is not a real source, that 
> will lead to some shuffle errors. So FLINK-13494 remove these implementations.
> In this ticket, I would like to introduce parallelism inference only for 
> InputFormatTableSource, the RowCount of InputFormatTableSource is more 
> accurate than downstream stages. It is worth to automatically generate its 
> parallelism.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-14676) Introduce parallelism inference for InputFormatTableSource

Reply via email to