[ 
https://issues.apache.org/jira/browse/FLINK-30695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luoyuxia updated FLINK-30695:
-----------------------------
    Description: 
After FLINK-29635, we introduce auto compaction for Hive sink.  But it may 
cause costing much time to compact in batch aqe since the paramlism inferred  
by aqe of the operator to compact files may small. 

In current design for compact files in Hive sink, there's a coordinator 
operator that collects all files written and decide which files should be merge 
to a file. It will pack the infomation to a CompactUnit which contains the 
files path that should be merge to a file.

Then, the coordinator operator will pass CompactUnit to downstream compact 
operator to do actual compaction.

The volume for the data emitted by the coordinator  is small for it only send 
control messages, which will cause the parallelism of the  compact operator 
small in aqe.  But actually,  most of work(reading files and write a new file) 
is done by the compact operator . If the parallelism of compact operator is 
small,  it must cost much time to compact.

Althogh the user can set the parallelism of the  compact operator  by manually, 
but it requires user to configure it and user may still find it's hard to set 
proper parallelism.

Ideally, the parallelism of the  compact operator should be equal to the number 
of the final merged  files which can be decided by the the coordinator 
operator. I think the aqe framework can provide some mechanism to make the 
operator itself decide the parallelism.

 

  was:
In current design for compact files in Hive sink, there's a coordinator 
operator that collects all files written and decide which files should be merge 
to a file. It will pack the infomation to a CompactUnit which contains the 
files path that should be merge to a file.

Then, the coordinator operator will pass CompactUnit to downstream compact 
operator to do actual compaction.

The volume for the data  emitted by the coordinator  is small for it only send 
control messages, which will cause the parallelism of the  compact operator 
small in aqe.  But actually,  most of work(reading files and write a new file) 
is done by the compact operator . If the parallelism of compact operator is 
small,  it must cost much time to compact.

Ideally, the parallelism of the  compact operator should be equal to the number 
of the finnal merged  files which can be decided by the the coordinator 
operator. I think the aqe framework can provide some mechanism to make the 
operator decide the parallelism.

 


> Support to set parallelism for compact operator according to the number of 
> files in AQE.
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-30695
>                 URL: https://issues.apache.org/jira/browse/FLINK-30695
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / Hive
>            Reporter: luoyuxia
>            Priority: Major
>
> After FLINK-29635, we introduce auto compaction for Hive sink.  But it may 
> cause costing much time to compact in batch aqe since the paramlism inferred  
> by aqe of the operator to compact files may small. 
> In current design for compact files in Hive sink, there's a coordinator 
> operator that collects all files written and decide which files should be 
> merge to a file. It will pack the infomation to a CompactUnit which contains 
> the files path that should be merge to a file.
> Then, the coordinator operator will pass CompactUnit to downstream compact 
> operator to do actual compaction.
> The volume for the data emitted by the coordinator  is small for it only send 
> control messages, which will cause the parallelism of the  compact operator 
> small in aqe.  But actually,  most of work(reading files and write a new 
> file) is done by the compact operator . If the parallelism of compact 
> operator is small,  it must cost much time to compact.
> Althogh the user can set the parallelism of the  compact operator  by 
> manually, but it requires user to configure it and user may still find it's 
> hard to set proper parallelism.
> Ideally, the parallelism of the  compact operator should be equal to the 
> number of the final merged  files which can be decided by the the coordinator 
> operator. I think the aqe framework can provide some mechanism to make the 
> operator itself decide the parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to