[ 
https://issues.apache.org/jira/browse/FLINK-30695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luoyuxia updated FLINK-30695:
-----------------------------
        Parent:     (was: FLINK-29635)
    Issue Type: Improvement  (was: Sub-task)

> Support to set parallelism for compact operator according to the number of 
> files in AQE.
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-30695
>                 URL: https://issues.apache.org/jira/browse/FLINK-30695
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / Hive
>            Reporter: luoyuxia
>            Priority: Major
>
> In current design for compact files in Hive sink, there's a coordinator 
> operator that collects all files written and decide which files should be 
> merge to a file. It will pack the infomation to a CompactUnit which contains 
> the files path that should be merge to a file.
> Then, the coordinator operator will pass CompactUnit to downstream compact 
> operator to do actual compaction.
> The volume for the data  emitted by the coordinator  is small for it only 
> send control messages, which will cause the parallelism of the  compact 
> operator small in aqe.  But actually,  most of work(reading files and write a 
> new file) is done by the compact operator . If the parallelism of compact 
> operator is small,  it must cost much time to compact.
> Ideally, the parallelism of the  compact operator should be equal to the 
> number of the finnal merged  files which can be decided by the the 
> coordinator operator. I think the aqe framework can provide some mechanism to 
> make the operator decide the parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to