Caizhi Weng created FLINK-19641:
-----------------------------------

             Summary: Optimize parallelism calculating of HiveTableSource by 
checking file number
                 Key: FLINK-19641
                 URL: https://issues.apache.org/jira/browse/FLINK-19641
             Project: Flink
          Issue Type: Improvement
          Components: Connectors / Hive
            Reporter: Caizhi Weng
             Fix For: 1.12.0


The current implementation of {{HiveTableSource#calculateParallelism}} directly 
uses {{inputFormat.createInputSplits(0).length}} as the number of splits. 
However {{createInputSplits}} may be costly as it will read some data from all 
source files, especially when the table is not partitioned and the number of 
files are large.

Many Hive tables maintain the number of files in that table, and it's obvious 
that the number of splits is at least the number of files. So we can try to 
fetch the number of files (almost without cost) first and if the number of 
files already exceeds maximum parallelism we can directly use the maximum 
parallelism without calling {{createInputSplits}}.

This is a significant optimization on the current flink TPCDS benchmark, which 
will create some table with 15000 files without partitioning. This optimization 
will improve the performance of the whole benchmark by 300s and more.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to