andreagarcia20 opened a new issue, #8302:
URL: https://github.com/apache/hudi/issues/8302

   **Problem Description**
   
   Hi team,
   We are trying to construct a Hudi application to run daily (as a batch job) 
to incrementally update data when new information arrives. This is our first 
time working with Hudi and some issues appear when doing experiments during the 
deployment. Currently, we have an input data of 11GB divided into 627 parquet 
files and we are working with two scenarios: 
   
   - 100% of inserts
   - 100% of updates
   
   Every time we try to run a complete execution (inserts or updates) we obtain 
an error of Executor Process Lost until all the executors die and the job 
fails. The failure always occurs on "Building Workload profile" stage (more 
info below)
   
   We are using AWS EMR with 1 master node (m5.2xlarge) and 4 instances 
(r4.8xlarge)
   
   **Hudi Options**
   
   hudiOptions = {
       "hoodie.table.name": "F5",
       "hoodie.datasource.write.table.type": "MERGE_ON_READ", 
       "hoodie.datasource.write.recordkey.field": "hour,cups", 
       "hoodie.datasource.write.partitionpath.field": 
"subsystem,year,month,day",
       "hoodie.datasource.write.precombine.field": "dedup", 
       "hoodie.datasource.write.hive_style_partitioning": "true", 
       "hoodie.datasource.write.drop.partition.columns": "true",
       "hooddie.compact.inline": "true",
       "hoodie.datasource.compaction.async.enable": "false",
       "hoodie.compact.inline.max.delta.commits": 1,
       "hoodie.cleaner.policy": "KEEP_LATEST_FILE_VERSIONS",
       "hoodie.cleaner.fileversions.retained": 1,
       "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
       "hoodie.datasource.write.payload.class": 
"org.apache.hudi.common.model.DefaultHoodieRecordPayload",
       "hoodie.index.type": "BLOOM",
       "hoodie.bloom.index.filter.type": "DYNAMIC_V0",
       "hoodie.upsert.shuffle.parallelism": 500,
       "hoodie.metadata.enable": "true",
       "hoodie.metadata.index.column.stats.enable": "true",
       "hoodie.enable.data.skipping": "true"
   }
   
   To write the data we always use the upsert mode: 
_.option('hoodie.datasource.write.operation', 'upsert')_
   
   **Spark-submit**
   --jars /usr/lib/hudi/hudi-spark-bundle.jar
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
   --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 
   --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
   --conf spark.emr.maximizeResourceAllocation=true
   
   **Environment Description**
   
   * Hudi version : 0.12.2-amzn-0
   * Spark version:  3.3.1
   * Hive version :  3.1.3
   * Hadoop version :  Amazon 3.3.3
   * Storage : S3
   * Running on Docker? (yes/no) : No
   
   **Additional context**
   We have been playing with both spark and hudi configurations (memory, number 
of executors/cores, parallelism...) but nothing seems to work for us. Any 
suggestion or improvement in our configuration is welcome.
   
   **SparkUI information**
   <img width="929" alt="image" 
src="https://user-images.githubusercontent.com/129081554/227980692-aa9830fe-a28b-4f8d-9dc7-5ae094772838.png";>
   
   <img width="944" alt="image" 
src="https://user-images.githubusercontent.com/129081554/227980959-af418b5e-4e75-4c14-bbfe-7451d37ef94e.png";>
   
   <img width="931" alt="image" 
src="https://user-images.githubusercontent.com/129081554/227981100-a6d9053a-c687-497b-83e3-79db87eeb266.png";>
   
   <img width="929" alt="image" 
src="https://user-images.githubusercontent.com/129081554/227981324-b95fc73c-6eb9-4333-97f4-065a5a072987.png";>
   
   <img width="935" alt="image" 
src="https://user-images.githubusercontent.com/129081554/227981497-1390b612-d5ad-48e6-8d60-5540e6ddfa60.png";>
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to