[GitHub] [iceberg] zhangjun0x01 opened a new pull request #1936: Flink : add parallelism optimize for IcebergTableSource

GitBox Mon, 14 Dec 2020 20:56:45 -0800


zhangjun0x01 opened a new pull request #1936:
URL: https://github.com/apache/iceberg/pull/1936



   When using flink to query the iceberg table, the parallelism is the default 
parallelism of flink, but the number of datafiles on  iceberg table is 
different. The user do not know how much parallelism should be used, and 
setting a too large parallelism will cause  resource waste, setting the 
parallelism too small will cause the query to be slow, so we can add  
parallelism infer.
   
   The function is enabled by default. the parallelism is equal to the number 
of data files. Of course, the user can manually turn off the infer function. In 
order to prevent too many datafiles from causing excessive parallelism, we also 
set a max infer parallelism.  When the infer parallelism exceeds the setting, 
use the max  parallelism.
   
     In addition, we also need to compare with the limit in the `select` query 
statement to get a more appropriate parallelism in the case of limit pushdown, 
for example we have a sql  `select * from table limit 1`, and finally we infer 
the parallelism is 10, but we  only one parallel  is needed , besause we only 
need a data .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] zhangjun0x01 opened a new pull request #1936: Flink : add parallelism optimize for IcebergTableSource

Reply via email to