[
https://issues.apache.org/jira/browse/FLINK-30695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
luoyuxia updated FLINK-30695:
-----------------------------
Description:
After FLINK-29635, we introduce auto compaction for Hive sink. But it may
cause costing much time to compact in batch aqe since the paramlism inferred
by aqe of the operator to compact files may small.
In current design for compact files in Hive sink, there's a coordinator
operator that collects all files written and decide which files should be merge
to a file. It will pack the infomation to a CompactUnit which contains the
files path that should be merge to a file.
Then, the coordinator operator will pass CompactUnit to downstream compact
operator to do actual compaction.
The volume for the data emitted by the coordinator is small for it only send
control messages, which will cause the parallelism of the compact operator
small in aqe. But actually, most of work(reading files and write a new file)
is done by the compact operator . If the parallelism of compact operator is
small, it must cost much time to compact.
Althogh the user can set the parallelism of the compact operator by manually,
but it requires user to configure it and user may still find it's hard to set
proper parallelism.
Ideally, the parallelism of the compact operator should be equal to the number
of the final merged files which can be decided by the the coordinator
operator. I think the aqe framework can provide some mechanism to make the
operator itself decide the parallelism.
was:
In current design for compact files in Hive sink, there's a coordinator
operator that collects all files written and decide which files should be merge
to a file. It will pack the infomation to a CompactUnit which contains the
files path that should be merge to a file.
Then, the coordinator operator will pass CompactUnit to downstream compact
operator to do actual compaction.
The volume for the data emitted by the coordinator is small for it only send
control messages, which will cause the parallelism of the compact operator
small in aqe. But actually, most of work(reading files and write a new file)
is done by the compact operator . If the parallelism of compact operator is
small, it must cost much time to compact.
Ideally, the parallelism of the compact operator should be equal to the number
of the finnal merged files which can be decided by the the coordinator
operator. I think the aqe framework can provide some mechanism to make the
operator decide the parallelism.
> Support to set parallelism for compact operator according to the number of
> files in AQE.
> ----------------------------------------------------------------------------------------
>
> Key: FLINK-30695
> URL: https://issues.apache.org/jira/browse/FLINK-30695
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / Hive
> Reporter: luoyuxia
> Priority: Major
>
> After FLINK-29635, we introduce auto compaction for Hive sink. But it may
> cause costing much time to compact in batch aqe since the paramlism inferred
> by aqe of the operator to compact files may small.
> In current design for compact files in Hive sink, there's a coordinator
> operator that collects all files written and decide which files should be
> merge to a file. It will pack the infomation to a CompactUnit which contains
> the files path that should be merge to a file.
> Then, the coordinator operator will pass CompactUnit to downstream compact
> operator to do actual compaction.
> The volume for the data emitted by the coordinator is small for it only send
> control messages, which will cause the parallelism of the compact operator
> small in aqe. But actually, most of work(reading files and write a new
> file) is done by the compact operator . If the parallelism of compact
> operator is small, it must cost much time to compact.
> Althogh the user can set the parallelism of the compact operator by
> manually, but it requires user to configure it and user may still find it's
> hard to set proper parallelism.
> Ideally, the parallelism of the compact operator should be equal to the
> number of the final merged files which can be decided by the the coordinator
> operator. I think the aqe framework can provide some mechanism to make the
> operator itself decide the parallelism.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)