[
https://issues.apache.org/jira/browse/FLINK-14676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972044#comment-16972044
]
Jingsong Lee commented on FLINK-14676:
--------------------------------------
Hi [~jark], thanks for you explain.
> let connectors to configure parallelism itself
First, InputFormatTableSource can not configure parallelism itself. Second,
source can not get statistics, only catalog table contains statistics. Third,
if we let connectors to configure parallelism itself, we need introduce
configuration to connectors too, connector may have different config options.
> why the parallelism is inferred by row_count/rows_per_partition?
Consider auto infer, is there any better indicator? Considering the CPU
consumption of the process, In the SQL processing scenario, more CPUs are based
on per record.
> What if the rowCount is empty or wrong?
rowCount of input format table source should be accurate, if rowCount is empty,
there should be a way to find out this situation. CC: [~godfreyhe]
Another thing, there is a switch: infer.mode.
> How to guarantee each partition process such number of rows?
Why we need guarantee? There is nothing need to be guaranteed, consider input
splits, partitions will spend as much as possible according to their own
processing input splits.
> partitioned source
partitioned source need update statistic after partition pruning.
> This may diverge stream and batch.
I don't think this can unify stream and batch, the statistics of stream source
are totally different from batch. It is hard to say there is a row count for
streaming source.
> exposing configuration to set source parallelism directly
I think it is not a good way, so many sources in graph, it is hard to let user
to do it. If you want to discuss it, infer.mode is introduced to open/close
infer.
> Introduce parallelism inference for InputFormatTableSource
> ----------------------------------------------------------
>
> Key: FLINK-14676
> URL: https://issues.apache.org/jira/browse/FLINK-14676
> Project: Flink
> Issue Type: New Feature
> Components: Table SQL / Planner
> Reporter: Jingsong Lee
> Assignee: Jingsong Lee
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.10.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> FLINK-12801 has introduce parallelism setting for table, but because
> TableSource generate DataStream, maybe DataStream is not a real source, that
> will lead to some shuffle errors. So FLINK-13494 remove these implementations.
> In this ticket, I would like to introduce parallelism inference only for
> InputFormatTableSource, the RowCount of InputFormatTableSource is more
> accurate than downstream stages. It is worth to automatically generate its
> parallelism.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)