[jira] [Commented] (FLINK-14676) Introduce parallelism inference for InputFormatTableSource

Jark Wu (Jira) Mon, 11 Nov 2019 19:06:50 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-14676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972014#comment-16972014
 ]


Jark Wu commented on FLINK-14676:
---------------------------------

Hi [~lzljs3620320], I'm not saying set parallelism of DataStream of 
StreamTableSource by framework. I suggested let connectors to configure 
parallelism itself. I don't know whether it can solve your problem, because you 
didn't mention the background requirment in the JIRA. What I want to avoid is 
introducing some temporary APIs. 

As you said, parallelism inference framework was reverted before because we 
missed something. IMO, parallelism inference is a big topic and should be 
designed throughly. I also have some confusion about the parallelism inference 
for InputFormatTableSource (according to the PR):
1) why the parallelism is inferred by row_count/rows_per_partition? What if the 
rowCount is empty or wrong? How to guarantee each partition process such number 
of rows? And what if it is not a partitioned source? 
2) the configuration is not applied to streaming mode. This may diverge stream 
and batch.
3) I think a more intuitive way is exposing configuration to set source 
parallelism directly. How to cooperate it with the rows_per_partition 
configuration?





> Introduce parallelism inference for InputFormatTableSource
> ----------------------------------------------------------
>
>                 Key: FLINK-14676
>                 URL: https://issues.apache.org/jira/browse/FLINK-14676
>             Project: Flink
>          Issue Type: New Feature
>          Components: Table SQL / Planner
>            Reporter: Jingsong Lee
>            Assignee: Jingsong Lee
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.10.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> FLINK-12801 has introduce parallelism setting for table, but because 
> TableSource generate DataStream, maybe DataStream is not a real source, that 
> will lead to some shuffle errors. So FLINK-13494 remove these implementations.
> In this ticket, I would like to introduce parallelism inference only for 
> InputFormatTableSource, the RowCount of InputFormatTableSource is more 
> accurate than downstream stages. It is worth to automatically generate its 
> parallelism.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-14676) Introduce parallelism inference for InputFormatTableSource

Reply via email to