[ 
https://issues.apache.org/jira/browse/FLINK-30695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luoyuxia updated FLINK-30695:
-----------------------------
    Description: 
In current design for compact files in Hive sink, there's a coordinator 
operator that collects all files written and decide which files should be merge 
to a file. It will pack the infomation to a CompactUnit which contains the 
files path that should be merge to a file.

Then, the coordinator operator will pass CompactUnit to downstream compact 
operator to do actual compaction.

The volume for the data  emitted by the coordinator  is small for it only send 
control messages, which will cause the parallelism of the  compact operator 
small in aqe.  But actually,  most of work(reading files and write a new file) 
is done by the compact operator . If the parallelism of compact operator is 
small,  it must cost much time to compact.

Ideally, the parallelism of the  compact operator should be equal to the number 
of the finnal merged  files which can be decided by the the coordinator 
operator. I think the aqe framework can provide some mechanism to make the 
operator decide the parallelism.

 

> Support to set parallelism for compact operator according to the number of 
> files in AQE.
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-30695
>                 URL: https://issues.apache.org/jira/browse/FLINK-30695
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Connectors / Hive
>            Reporter: luoyuxia
>            Priority: Major
>
> In current design for compact files in Hive sink, there's a coordinator 
> operator that collects all files written and decide which files should be 
> merge to a file. It will pack the infomation to a CompactUnit which contains 
> the files path that should be merge to a file.
> Then, the coordinator operator will pass CompactUnit to downstream compact 
> operator to do actual compaction.
> The volume for the data  emitted by the coordinator  is small for it only 
> send control messages, which will cause the parallelism of the  compact 
> operator small in aqe.  But actually,  most of work(reading files and write a 
> new file) is done by the compact operator . If the parallelism of compact 
> operator is small,  it must cost much time to compact.
> Ideally, the parallelism of the  compact operator should be equal to the 
> number of the finnal merged  files which can be decided by the the 
> coordinator operator. I think the aqe framework can provide some mechanism to 
> make the operator decide the parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to