[ 
https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071145#comment-17071145
 ] 

Feichi Feng commented on HUDI-724:
----------------------------------

PR is merged. 

> Parallelize GetSmallFiles For Partitions
> ----------------------------------------
>
>                 Key: HUDI-724
>                 URL: https://issues.apache.org/jira/browse/HUDI-724
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Writer Core
>            Reporter: Feichi Feng
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>          Time Spent: 40m
>  Remaining Estimate: 47h 20m
>
> When writing data, a gap was observed between spark stages. By tracking down 
> where the time was spent on the spark driver, it's get-small-files operation 
> for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it 
> uses a normal for-loop for get the list of small files for all partitions 
> that the load is going to load data to, and the process is very slow when 
> there are a lot of partitions to go through. While the operation is running 
> on spark driver process, all other worker nodes are sitting idle waiting for 
> tasks.
> For all those partitions, they don't affect each other, so the 
> get-small-files operations can be parallelized. The change I made is to pass 
> the JavaSparkContext to the UpsertPartitioner, and create RDD for the 
> partitions and eventually send the get small files operations to multiple 
> tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to