nochimow opened a new issue, #6811:
URL: https://github.com/apache/hudi/issues/6811

   **Describe the problem you faced**
   
   After one month of data ingestion, a data ingestion pipeline started to take 
a very long time during the upsert operation
   We currently use Hudi Spark 0.12 running on AWS Glue 3.0 (Spark 3.1), 
writing the Hudi files in a CoW format into a S3 bucket.
   
   **Environment Details**
   The Glue Job currently have 75 Workers of G.2X type. (Each worker have 8 
vCPU, 32 GB of memory and 128 GB disk)
   This table have currently 9.4TB of storage in S3, and all data is 
partitioned by year, month and day. Being the first record from 2020-11.
   Our typical data input are .AVRO files with the average size of 3,5GB, but 
splited into 1200 files of an average size of 3MB, Being the smallest 1.72MB 
and the largest 8MB.
   Our job basically load all the files into a spark dataframe and write into 
Hudi S3 location.
   
   Our Hudi Config is set like this:
   "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
   "hoodie.datasource.write.payload.class": 
"org.apache.hudi.common.model.DefaultHoodieRecordPayload",
   "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
   "hoodie.datasource.write.hive_style_partitioning": "true",
   "hoodie.write.concurrency.mode": "single_writer",
   "hoodie.cleaner.commits.retained": 1,
   "hoodie.fail.on.timeline.archiving": False,
   "hoodie.keep.max.commits": 3,
   "hoodie.keep.min.commits": 2,
   "hoodie.bloom.index.use.caching": True,
   "hoodie.parquet.compression.codec": "snappy",
   "hoodie.index.type": "BLOOM",
   "hoodie.metrics.on": True,
   "hoodie.metrics.reporter.type": "CLOUDWATCH"
   
   **To Reproduce**
   
   
   * Hudi version :
   0.12
   * Spark version :
   3.1.1
   
   * Storage (HDFS/S3/GCS..) :
   S3
   
   **Stacktrace**
   
   Seeing by the SparkUI the most expensive step is the 
SparkUpsertCommitActionExecutor, that is taking all the 4.5h of duration.
   
![sparkuicapture](https://user-images.githubusercontent.com/83978996/192640710-fac79fd0-dd0d-489a-9e74-0460faeb62c4.PNG)
   
   But seeing this step into detail, we have the following stats
   
![sparkuicapture2](https://user-images.githubusercontent.com/83978996/192640770-889212a6-b3c4-415f-8c9b-4265f0081063.PNG)
   
   Its also possible to check that we had some failed jobs on this step
   
![sparkuicapture3](https://user-images.githubusercontent.com/83978996/192641145-a2100f7a-de0a-4102-89a2-7a7c3e3ca878.PNG)
   
   
   Do you have any suggestion that can be done to optimize this job with this 
characteristics? Since that with running with this performance, the job it 
becoming really expensive to run.
   
   If there is any other log from this execution just let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to