vinothchandar commented on issue #1757: URL: https://github.com/apache/hudi/issues/1757#issuecomment-648478687
> " hoodie.[insert|upsert|bulkinsert].shuffle.parallelism such that its atleast input_data_size/500MB The reason for this was the 2GB limitation in Spark shuffle.. I see you are on Spark 2.4, which should work get rid of this anyway. Still worth trying to increase the parallelism to 10K , may be.. it will ensure that the memory needed for each partition is lower (spark's own datastructures).. Also notice how much data it's shuffle spilling.. In all your runs, GC time is very high.. stage 4 75th percentile, for e.g 45 mins out of 3.5h. Consider giving large heap and tune gc bit? https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide has a sample to try.. Looking at stage 4: it's the actual write.. line [here](https://github.com/apache/hudi/blob/release-0.5.3/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L262) triggers the DAG and counts the errors during write.. The workload distribution there is based on Spark sort's range partitioner.. seems like it does a reasonable job (2.5x swing between min and max records).. Guess it is what it is.. the sorting here is useful down the line as you perform upserts.. e.g if you have ordered keys , this pre-sort gives you very good index performance. Looking at stage 6: Seems like Spark lost a partition in the mean time and thus recomputed the RDD from scratch.. You can see one max task taking up 3hrs there.. My guess is its recomputing. if we make stage 4 more scalable, then I think this will also go away IMO.. Since shuffle uses local disk, I will also ensure cluster is big enough to hold the data needed for shuffle.. Btw all these are spark shuffle tuning, not Hudi specific per se.. and a 5.5TB shuffle is a good size :) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
