vinothchandar commented on issue #1757:
URL: https://github.com/apache/hudi/issues/1757#issuecomment-648478687


   > " hoodie.[insert|upsert|bulkinsert].shuffle.parallelism such that its 
atleast input_data_size/500MB
   The reason for this was the 2GB limitation in Spark shuffle.. I see you are 
on Spark 2.4, which should work get rid of this anyway. 
   
   Still worth trying to increase the parallelism to 10K , may be.. it will 
ensure that the memory needed for each partition is lower (spark's own 
datastructures).. Also notice how much data it's shuffle spilling.. In all your 
runs, GC time is very high.. stage 4 75th percentile, for e.g 45 mins out of 
3.5h. Consider giving large heap and tune gc bit? 
https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide has a sample to 
try.. 
   
   Looking at stage 4: 
   it's the actual write.. line 
[here](https://github.com/apache/hudi/blob/release-0.5.3/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L262)
 triggers the DAG and counts the errors during write.. The workload 
distribution there is based on Spark sort's range partitioner.. seems like it 
does a reasonable job (2.5x swing between min and max records).. Guess it is 
what it is.. the sorting here is useful down the line as you perform upserts.. 
e.g if you have ordered keys , this pre-sort gives you very good index 
performance.
   
   Looking at stage 6:
   Seems like Spark lost a partition in the mean time and thus recomputed the 
RDD from scratch.. You can see one max task taking up 3hrs there.. My guess is 
its recomputing. if we make stage 4 more scalable, then I think this will also 
go away IMO.. Since shuffle uses local disk, I will also ensure cluster is big 
enough to hold the data needed for shuffle.. 
   
   Btw all these are spark shuffle tuning, not Hudi specific per se.. and a 
5.5TB shuffle is a good size :)  
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to