andreagarcia20 opened a new issue, #8302:
URL: https://github.com/apache/hudi/issues/8302
**Problem Description**
Hi team,
We are trying to construct a Hudi application to run daily (as a batch job)
to incrementally update data when new information arrives. This is our first
time working with Hudi and some issues appear when doing experiments during the
deployment. Currently, we have an input data of 11GB divided into 627 parquet
files and we are working with two scenarios:
- 100% of inserts
- 100% of updates
Every time we try to run a complete execution (inserts or updates) we obtain
an error of Executor Process Lost until all the executors die and the job
fails. The failure always occurs on "Building Workload profile" stage (more
info below)
We are using AWS EMR with 1 master node (m5.2xlarge) and 4 instances
(r4.8xlarge)
**Hudi Options**
hudiOptions = {
"hoodie.table.name": "F5",
"hoodie.datasource.write.table.type": "MERGE_ON_READ",
"hoodie.datasource.write.recordkey.field": "hour,cups",
"hoodie.datasource.write.partitionpath.field":
"subsystem,year,month,day",
"hoodie.datasource.write.precombine.field": "dedup",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.datasource.write.drop.partition.columns": "true",
"hooddie.compact.inline": "true",
"hoodie.datasource.compaction.async.enable": "false",
"hoodie.compact.inline.max.delta.commits": 1,
"hoodie.cleaner.policy": "KEEP_LATEST_FILE_VERSIONS",
"hoodie.cleaner.fileversions.retained": 1,
"hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.hive_sync.partition_extractor_class":
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.write.payload.class":
"org.apache.hudi.common.model.DefaultHoodieRecordPayload",
"hoodie.index.type": "BLOOM",
"hoodie.bloom.index.filter.type": "DYNAMIC_V0",
"hoodie.upsert.shuffle.parallelism": 500,
"hoodie.metadata.enable": "true",
"hoodie.metadata.index.column.stats.enable": "true",
"hoodie.enable.data.skipping": "true"
}
To write the data we always use the upsert mode:
_.option('hoodie.datasource.write.operation', 'upsert')_
**Spark-submit**
--jars /usr/lib/hudi/hudi-spark-bundle.jar
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
--conf
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
--conf spark.emr.maximizeResourceAllocation=true
**Environment Description**
* Hudi version : 0.12.2-amzn-0
* Spark version: 3.3.1
* Hive version : 3.1.3
* Hadoop version : Amazon 3.3.3
* Storage : S3
* Running on Docker? (yes/no) : No
**Additional context**
We have been playing with both spark and hudi configurations (memory, number
of executors/cores, parallelism...) but nothing seems to work for us. Any
suggestion or improvement in our configuration is welcome.
**SparkUI information**
<img width="929" alt="image"
src="https://user-images.githubusercontent.com/129081554/227980692-aa9830fe-a28b-4f8d-9dc7-5ae094772838.png">
<img width="944" alt="image"
src="https://user-images.githubusercontent.com/129081554/227980959-af418b5e-4e75-4c14-bbfe-7451d37ef94e.png">
<img width="931" alt="image"
src="https://user-images.githubusercontent.com/129081554/227981100-a6d9053a-c687-497b-83e3-79db87eeb266.png">
<img width="929" alt="image"
src="https://user-images.githubusercontent.com/129081554/227981324-b95fc73c-6eb9-4333-97f4-065a5a072987.png">
<img width="935" alt="image"
src="https://user-images.githubusercontent.com/129081554/227981497-1390b612-d5ad-48e6-8d60-5540e6ddfa60.png">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]