Feichi Feng created HUDI-724:
--------------------------------
Summary: Parallelize GetSmallFiles For Partitions
Key: HUDI-724
URL: https://issues.apache.org/jira/browse/HUDI-724
Project: Apache Hudi (incubating)
Issue Type: Improvement
Components: Performance, Writer Core
Reporter: Feichi Feng
Attachments: gap.png, nogapAfterImprovement.png
When writing data, a gap was observed between spark stages. By tracking down
where the time was spent on the spark driver, it's get-small-files operation
for partitions.
When creating the UpsertPartitioner and trying to assign insert records, it
uses a normal for-loop for get the list of small files for all partitions that
the load is going to load data to, and the process is very slow when there are
a lot of partitions to go through. While the operation is running on spark
driver process, all other worker nodes are sitting idle waiting for tasks.
For all those partitions, they don't affect each other, so the get-small-files
operations can be parallelized. The change I made is to pass the
JavaSparkContext to the UpsertPartitioner, and create RDD for the partitions
and eventually send the get small files operations to multiple tasks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)