[GitHub] [hudi] vinothchandar commented on issue #1757: Slow Bulk Insert Performance [SUPPORT]

GitBox Tue, 23 Jun 2020 16:25:03 -0700


vinothchandar commented on issue #1757:
URL: https://github.com/apache/hudi/issues/1757#issuecomment-648478687

> " hoodie.[insert|upsert|bulkinsert].shuffle.parallelism such that its
atleast input_data_size/500MB
The reason for this was the 2GB limitation in Spark shuffle.. I see you are
on Spark 2.4, which should work get rid of this anyway.

Still worth trying to increase the parallelism to 10K , may be.. it will
ensure that the memory needed for each partition is lower (spark's own
datastructures).. Also notice how much data it's shuffle spilling.. In all your
runs, GC time is very high.. stage 4 75th percentile, for e.g 45 mins out of
3.5h. Consider giving large heap and tune gc bit?
https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide has a sample to
try..

Looking at stage 4:
it's the actual write.. line
[here](https://github.com/apache/hudi/blob/release-0.5.3/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L262)
triggers the DAG and counts the errors during write.. The workload
distribution there is based on Spark sort's range partitioner.. seems like it
does a reasonable job (2.5x swing between min and max records).. Guess it is
what it is.. the sorting here is useful down the line as you perform upserts..
e.g if you have ordered keys , this pre-sort gives you very good index
performance.

Looking at stage 6:
Seems like Spark lost a partition in the mean time and thus recomputed the
RDD from scratch.. You can see one max task taking up 3hrs there.. My guess is
its recomputing. if we make stage 4 more scalable, then I think this will also
go away IMO.. Since shuffle uses local disk, I will also ensure cluster is big
enough to hold the data needed for shuffle..

Btw all these are spark shuffle tuning, not Hudi specific per se.. and a
5.5TB shuffle is a good size :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on issue #1757: Slow Bulk Insert Performance [SUPPORT]

Reply via email to