[
https://issues.apache.org/jira/browse/FLINK-30695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17682380#comment-17682380
]
luoyuxia commented on FLINK-30695:
----------------------------------
Maybe [~wanglijie] [~zhuzh] can have a look.
> Support to set parallelism for compact operator according to the number of
> files in AQE.
> ----------------------------------------------------------------------------------------
>
> Key: FLINK-30695
> URL: https://issues.apache.org/jira/browse/FLINK-30695
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / Hive
> Reporter: luoyuxia
> Priority: Major
>
> After FLINK-29635, we introduce auto compaction for Hive sink. But it may
> cause costing much time to compact in batch aqe since the paramlism inferred
> by aqe of the operator to compact files may small.
> In current design for compact files in Hive sink, there's a coordinator
> operator that collects all files written and decide which files should be
> merge to a file. It will pack the infomation to a CompactUnit which contains
> the files path that should be merge to a file.
> Then, the coordinator operator will pass CompactUnit to downstream compact
> operator to do actual compaction.
> The volume for the data emitted by the coordinator is small for it only send
> control messages, which will cause the parallelism of the compact operator
> small in aqe. But actually, most of work(reading files and write a new
> file) is done by the compact operator . If the parallelism of compact
> operator is small, it must cost much time to compact.
> Althogh the user can set the parallelism of the compact operator by
> manually, but it requires user to configure it and user may still find it's
> hard to set proper parallelism.
> Ideally, the parallelism of the compact operator should be equal to the
> number of the final merged files which can be decided by the the coordinator
> operator. I think the aqe framework can provide some mechanism to make the
> operator itself decide the parallelism.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)