nochimow opened a new issue #4299: URL: https://github.com/apache/hudi/issues/4299
Hi there, I'm currently facing some performance gaps in one specific table after we load 3 years of data. Our typical cenario is the following: Ingestion of 57 avro files (stored on S3) with typical sizes of 70-128MB giving 2,75GB of input data. We create a spark data-frame loading all these files and writing into Hudi. This data is equal 186 million of rows. The schema of the table is composed of 7 string columns, 2 BigInt columns, partitioned by 3 String columns (Day, Month, Year). We also know that this table may have 5 to 10% of updates. (The data comes from a CDC engine) In the beginning, the same load duration was < 20minutes, but after loading 3 years of data, the duration increased to 1,5hours, and after we upgrade hudi from 0.8 to 0.10 the duration decreased to 1,25 hours in average. I'm looking for any tuning tips that may help decreasing the load duration even more, and try to let this time more stable considering the data growth over time. **Other useful information:** **Hoodie configs:** "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.payload.class": "org.apache.hudi.common.model.DefaultHoodieRecordPayload", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.table.name": table_name, "hoodie.datasource.write.recordkey.field": IDX_COL, "hoodie.datasource.write.partitionpath.field": pks, "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.datasource.write.precombine.field": tiebreaker, "hoodie.datasource.write.operation": operation, "hoodie.write.concurrency.mode": "single_writer", "hoodie.cleaner.commits.retained": 1, "hoodie.fail.on.timeline.archiving": False, "hoodie.keep.max.commits": 3, "hoodie.keep.min.commits": 2, "hoodie.bloom.index.use.caching": True, "hoodie.parquet.compression.codec": "snappy" **Environment Description** AWS Glue Job **Hudi version :** 0.10 **Spark version :** "Spark 2.4 - Python 3 with improved job times (Glue Version 2.0)" **Storage (HDFS/S3/GCS..) :** S3 **Running on Docker? (yes/no) :** No **Additional context:** Infrastructure: Glue Job + S3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
