sanjaytva commented on issue #12116: URL: https://github.com/apache/hudi/issues/12116#issuecomment-2474067362
@ad1happy2go The similar issue happened to me as well in **_bulk_insert_** mode where when i tried to insert a 34GB json input with 100 partitions to hudi table(approx 5.5GB parquet) got an `java.lang.OutOfMemoryError: Java heap space`  I am using hoodie.bulkinsert.shuffle.parallelism = '10',hoodie.write.markers.type = 'direct', hoodie.embed.timeline.server = 'false'. As @dataproblems stated when I started with less memory to job I got `Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 16 tasks (1080.2 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)` so that's the reason even I too have increased the spark.driver.maxResultSize property looks like the whole data is somehow transferred to driver memory. But as u pointed the no of tasks are less to me as well due to 100 source partitions read, let me also once try with repartition of the dataset by 200 & increase the hoodie.bukinsert.shuffle.parallelism to 100. Any further suggestions /optimizations I shall do as in SPARKSQL Hudi table creation, bulk_insert is the only mode available by default i know i can switch to DF API'S but using SQL can we switch to normal insert? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
