[ 
https://issues.apache.org/jira/browse/FLINK-30695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17682380#comment-17682380
 ] 

luoyuxia commented on FLINK-30695:
----------------------------------

Maybe [~wanglijie] [~zhuzh] can have a look. 

> Support to set parallelism for compact operator according to the number of 
> files in AQE.
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-30695
>                 URL: https://issues.apache.org/jira/browse/FLINK-30695
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / Hive
>            Reporter: luoyuxia
>            Priority: Major
>
> After FLINK-29635, we introduce auto compaction for Hive sink.  But it may 
> cause costing much time to compact in batch aqe since the paramlism inferred  
> by aqe of the operator to compact files may small. 
> In current design for compact files in Hive sink, there's a coordinator 
> operator that collects all files written and decide which files should be 
> merge to a file. It will pack the infomation to a CompactUnit which contains 
> the files path that should be merge to a file.
> Then, the coordinator operator will pass CompactUnit to downstream compact 
> operator to do actual compaction.
> The volume for the data emitted by the coordinator  is small for it only send 
> control messages, which will cause the parallelism of the  compact operator 
> small in aqe.  But actually,  most of work(reading files and write a new 
> file) is done by the compact operator . If the parallelism of compact 
> operator is small,  it must cost much time to compact.
> Althogh the user can set the parallelism of the  compact operator  by 
> manually, but it requires user to configure it and user may still find it's 
> hard to set proper parallelism.
> Ideally, the parallelism of the  compact operator should be equal to the 
> number of the final merged  files which can be decided by the the coordinator 
> operator. I think the aqe framework can provide some mechanism to make the 
> operator itself decide the parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to