[
https://issues.apache.org/jira/browse/FLINK-27338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-27338:
-----------------------------------
Labels: pull-request-available (was: )
> Improve spliting file for Hive soure
> ------------------------------------
>
> Key: FLINK-27338
> URL: https://issues.apache.org/jira/browse/FLINK-27338
> Project: Flink
> Issue Type: Sub-task
> Components: Connectors / Hive
> Reporter: luoyuxia
> Assignee: luoyuxia
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.16.0
>
>
> Currently, for hive source, it'll use the hdfs block size configured with key
> dfs.block.size in hdfs-site.xml as the max split size to split the files. The
> default value is usually 128M/256M depending on configuration.
> The strategy to split file is not reasonable for the number of splits tend to
> be less so that can't make good use of the parallel computing.
> What's more, when enable parallelism inference for hive source, it'll set the
> parallelism of Hive source to the num of splits when it's not bigger than max
> parallelism. So, it'll limit the source parallelism and could degrade the
> perfermance.
> To solve this problem, the idea is to calcuate a reasonable split size based
> on files's total size, block size, default parallelism or parallelism
> configured by user.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)