rubenssoto opened a new issue #2003:
URL: https://github.com/apache/hudi/issues/2003


   Hi Guys,
   
   I'm trying to migrate my biggest dataset to Hudi and I'm facing some errors.
   
   Data Size: 350Gb
   Spark Master: 4 Cpus, 16 Gb Ram
   Cores Nodes: 8 R5.4xLarge = 16 cpus, 122 Gb ram EACH
   
   **MY spark Submit:**
   
   `spark-submit --deploy-mode cluster --conf "spark.executor.extraJavaOptions 
-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops 
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" --conf spark.executor.cores=5 
--conf spark.executor.memory=33g --conf spark.executor.memoryOverhead=2048 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.sql.hive.convertMetastoreParquet=false --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
 `
   
   
   My hudi Options:
   
   {
      "hoodie.datasource.write.recordkey.field":"id",
      "hoodie.table.name":"stockout",
      "hoodie.datasource.write.table.name":"stockout",
      "hoodie.datasource.write.operation":"bulk_insert",
      "hoodie.datasource.write.partitionpath.field":"created_date_brt",
      "hoodie.datasource.write.hive_style_partitioning":"true",
      "hoodie.combine.before.insert":"true",
      "hoodie.combine.before.upsert":"false",
      "hoodie.datasource.write.precombine.field":"LineCreatedTimestamp",
      
"hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.SimpleKeyGenerator",
      "hoodie.parquet.small.file.limit":996147200,
      "hoodie.parquet.max.file.size":1073741824,
      "hoodie.parquet.block.size":1073741824,
      "hoodie.copyonwrite.record.size.estimate":512,
      "hoodie.cleaner.commits.retained":10,
      "hoodie.datasource.hive_sync.enable":"true",
      "hoodie.datasource.hive_sync.database":"datalake_raw",
      "hoodie.datasource.hive_sync.table":"stockout",
      "hoodie.datasource.hive_sync.partition_fields":"created_date_brt",
      
"hoodie.datasource.hive_sync.partition_extractor_class":"org.apache.hudi.hive.MultiPartKeysValueExtractor",
      
"hoodie.datasource.hive_sync.jdbcurl":"jdbc:hive2://ip-10-0-21-127.us-west-2.compute.internal:10000",
      "hoodie.insert.shuffle.parallelism":1500,
      "hoodie.bulkinsert.shuffle.parallelism":700,
      "hoodie.upsert.shuffle.parallelism":1500
   }
   
   <img width="1680" alt="Captura de Tela 2020-08-20 às 16 15 10" 
src="https://user-images.githubusercontent.com/36298331/90816019-f8640b80-e301-11ea-8334-c64bd3e0278c.png";>
   <img width="1680" alt="Captura de Tela 2020-08-20 às 16 14 38" 
src="https://user-images.githubusercontent.com/36298331/90816029-fc902900-e301-11ea-9515-6f407d05968e.png";>
   <img width="1680" alt="Captura de Tela 2020-08-20 às 16 14 10" 
src="https://user-images.githubusercontent.com/36298331/90816031-fdc15600-e301-11ea-9830-47c2c91ee983.png";>
   <img width="1680" alt="Captura de Tela 2020-08-20 às 16 13 46" 
src="https://user-images.githubusercontent.com/36298331/90816034-fe59ec80-e301-11ea-96e8-0b22de34e233.png";>
   
   
   I tried use bulk_insert paralelism with 4000 but didn't work. I really don't 
know what to do...
   
   Thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to