rubenssoto opened a new issue #2003:
URL: https://github.com/apache/hudi/issues/2003
Hi Guys,
I'm trying to migrate my biggest dataset to Hudi and I'm facing some errors.
Data Size: 350Gb
Spark Master: 4 Cpus, 16 Gb Ram
Cores Nodes: 8 R5.4xLarge = 16 cpus, 122 Gb ram EACH
**MY spark Submit:**
`spark-submit --deploy-mode cluster --conf "spark.executor.extraJavaOptions
-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" --conf spark.executor.cores=5
--conf spark.executor.memory=33g --conf spark.executor.memoryOverhead=2048
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf
spark.sql.hive.convertMetastoreParquet=false --packages
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
`
My hudi Options:
{
"hoodie.datasource.write.recordkey.field":"id",
"hoodie.table.name":"stockout",
"hoodie.datasource.write.table.name":"stockout",
"hoodie.datasource.write.operation":"bulk_insert",
"hoodie.datasource.write.partitionpath.field":"created_date_brt",
"hoodie.datasource.write.hive_style_partitioning":"true",
"hoodie.combine.before.insert":"true",
"hoodie.combine.before.upsert":"false",
"hoodie.datasource.write.precombine.field":"LineCreatedTimestamp",
"hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.SimpleKeyGenerator",
"hoodie.parquet.small.file.limit":996147200,
"hoodie.parquet.max.file.size":1073741824,
"hoodie.parquet.block.size":1073741824,
"hoodie.copyonwrite.record.size.estimate":512,
"hoodie.cleaner.commits.retained":10,
"hoodie.datasource.hive_sync.enable":"true",
"hoodie.datasource.hive_sync.database":"datalake_raw",
"hoodie.datasource.hive_sync.table":"stockout",
"hoodie.datasource.hive_sync.partition_fields":"created_date_brt",
"hoodie.datasource.hive_sync.partition_extractor_class":"org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.jdbcurl":"jdbc:hive2://ip-10-0-21-127.us-west-2.compute.internal:10000",
"hoodie.insert.shuffle.parallelism":1500,
"hoodie.bulkinsert.shuffle.parallelism":700,
"hoodie.upsert.shuffle.parallelism":1500
}
<img width="1680" alt="Captura de Tela 2020-08-20 às 16 15 10"
src="https://user-images.githubusercontent.com/36298331/90816019-f8640b80-e301-11ea-8334-c64bd3e0278c.png">
<img width="1680" alt="Captura de Tela 2020-08-20 às 16 14 38"
src="https://user-images.githubusercontent.com/36298331/90816029-fc902900-e301-11ea-9515-6f407d05968e.png">
<img width="1680" alt="Captura de Tela 2020-08-20 às 16 14 10"
src="https://user-images.githubusercontent.com/36298331/90816031-fdc15600-e301-11ea-9830-47c2c91ee983.png">
<img width="1680" alt="Captura de Tela 2020-08-20 às 16 13 46"
src="https://user-images.githubusercontent.com/36298331/90816034-fe59ec80-e301-11ea-96e8-0b22de34e233.png">
I tried use bulk_insert paralelism with 4000 but didn't work. I really don't
know what to do...
Thank you.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]