[GitHub] [hudi] rubenssoto opened a new issue #2003: [SUPPORT] Spark Fails to Process 300Gb Of Data

GitBox Thu, 20 Aug 2020 12:29:22 -0700


rubenssoto opened a new issue #2003:
URL: https://github.com/apache/hudi/issues/2003



   Hi Guys,
   
   I'm trying to migrate my biggest dataset to Hudi and I'm facing some errors.
   
   Data Size: 350Gb
   Spark Master: 4 Cpus, 16 Gb Ram
   Cores Nodes: 8 R5.4xLarge = 16 cpus, 122 Gb ram EACH
   
   **MY spark Submit:**
   
   `spark-submit --deploy-mode cluster --conf "spark.executor.extraJavaOptions 
-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops 
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" --conf spark.executor.cores=5 
--conf spark.executor.memory=33g --conf spark.executor.memoryOverhead=2048 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.sql.hive.convertMetastoreParquet=false --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
 `
   
   
   My hudi Options:
   
   {
      "hoodie.datasource.write.recordkey.field":"id",
      "hoodie.table.name":"stockout",
      "hoodie.datasource.write.table.name":"stockout",
      "hoodie.datasource.write.operation":"bulk_insert",
      "hoodie.datasource.write.partitionpath.field":"created_date_brt",
      "hoodie.datasource.write.hive_style_partitioning":"true",
      "hoodie.combine.before.insert":"true",
      "hoodie.combine.before.upsert":"false",
      "hoodie.datasource.write.precombine.field":"LineCreatedTimestamp",
      
"hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.SimpleKeyGenerator",
      "hoodie.parquet.small.file.limit":996147200,
      "hoodie.parquet.max.file.size":1073741824,
      "hoodie.parquet.block.size":1073741824,
      "hoodie.copyonwrite.record.size.estimate":512,
      "hoodie.cleaner.commits.retained":10,
      "hoodie.datasource.hive_sync.enable":"true",
      "hoodie.datasource.hive_sync.database":"datalake_raw",
      "hoodie.datasource.hive_sync.table":"stockout",
      "hoodie.datasource.hive_sync.partition_fields":"created_date_brt",
      
"hoodie.datasource.hive_sync.partition_extractor_class":"org.apache.hudi.hive.MultiPartKeysValueExtractor",
      
"hoodie.datasource.hive_sync.jdbcurl":"jdbc:hive2://ip-10-0-21-127.us-west-2.compute.internal:10000",
      "hoodie.insert.shuffle.parallelism":1500,
      "hoodie.bulkinsert.shuffle.parallelism":700,
      "hoodie.upsert.shuffle.parallelism":1500
   }
   
   <img width="1680" alt="Captura de Tela 2020-08-20 às 16 15 10" 
src="https://user-images.githubusercontent.com/36298331/90816019-f8640b80-e301-11ea-8334-c64bd3e0278c.png";>
   <img width="1680" alt="Captura de Tela 2020-08-20 às 16 14 38" 
src="https://user-images.githubusercontent.com/36298331/90816029-fc902900-e301-11ea-9515-6f407d05968e.png";>
   <img width="1680" alt="Captura de Tela 2020-08-20 às 16 14 10" 
src="https://user-images.githubusercontent.com/36298331/90816031-fdc15600-e301-11ea-9830-47c2c91ee983.png";>
   <img width="1680" alt="Captura de Tela 2020-08-20 às 16 13 46" 
src="https://user-images.githubusercontent.com/36298331/90816034-fe59ec80-e301-11ea-96e8-0b22de34e233.png";>
   
   
   I tried use bulk_insert paralelism with 4000 but didn't work. I really don't 
know what to do...
   
   Thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rubenssoto opened a new issue #2003: [SUPPORT] Spark Fails to Process 300Gb Of Data

Reply via email to