sanjaytva commented on issue #12116:
URL: https://github.com/apache/hudi/issues/12116#issuecomment-2474067362

   @ad1happy2go The similar issue happened to me as well in **_bulk_insert_** 
mode where when i tried to insert a 34GB json input with 100 partitions to hudi 
table(approx 5.5GB parquet) got an `java.lang.OutOfMemoryError: Java heap space`
   
   
![image](https://github.com/user-attachments/assets/08b23193-f07b-4d27-8218-69bbe8fbe559)
   
   I am using hoodie.bulkinsert.shuffle.parallelism = 
'10',hoodie.write.markers.type = 'direct', hoodie.embed.timeline.server = 
'false'.
   
   As @dataproblems stated when I started with less memory to job I got `Caused 
by: org.apache.spark.SparkException: Job aborted due to stage failure: Total 
size of serialized results of 16 tasks (1080.2 MiB) is bigger than 
spark.driver.maxResultSize (1024.0 MiB)` so that's the reason even I too have 
increased the spark.driver.maxResultSize property looks like the whole data is 
somehow transferred to driver memory.
   
   But as u pointed the no of tasks are less to me as well due to 100 source 
partitions read, let me also once try with repartition of the dataset by 200 & 
increase the hoodie.bukinsert.shuffle.parallelism to 100. 
   
   Any further suggestions /optimizations I shall do as in SPARKSQL Hudi table 
creation,  bulk_insert is the only mode available by default i know i can 
switch to DF API'S but using SQL can we switch to normal insert?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to