[jira] [Commented] (HUDI-724) Parallelize GetSmallFiles For Partitions

Udit Mehrotra (Jira) Thu, 19 Mar 2020 18:47:06 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063033#comment-17063033
 ]


Udit Mehrotra commented on HUDI-724:
------------------------------------

Thanks Feichi for putting this out ! [~vinoth] [~vbalaji] Feichi is working on 
adopting Hudi for one of AWS teams. He is seeing significant performance 
improvement by parallelizing listing of small files. We did turn on the 
embedded timeline server, but guess it did not help much probably because of 
caching for the first time ? Would like to get your thoughts if this 
parallelization is safe to do.

> Parallelize GetSmallFiles For Partitions
> ----------------------------------------
>
>                 Key: HUDI-724
>                 URL: https://issues.apache.org/jira/browse/HUDI-724
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Writer Core
>            Reporter: Feichi Feng
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>          Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When writing data, a gap was observed between spark stages. By tracking down 
> where the time was spent on the spark driver, it's get-small-files operation 
> for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it 
> uses a normal for-loop for get the list of small files for all partitions 
> that the load is going to load data to, and the process is very slow when 
> there are a lot of partitions to go through. While the operation is running 
> on spark driver process, all other worker nodes are sitting idle waiting for 
> tasks.
> For all those partitions, they don't affect each other, so the 
> get-small-files operations can be parallelized. The change I made is to pass 
> the JavaSparkContext to the UpsertPartitioner, and create RDD for the 
> partitions and eventually send the get small files operations to multiple 
> tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-724) Parallelize GetSmallFiles For Partitions

Reply via email to