nochimow opened a new issue #4299:
URL: https://github.com/apache/hudi/issues/4299


   Hi there,
   I'm currently facing some performance gaps in one specific table after we 
load 3 years of data.
   Our typical cenario is the following:
   
   Ingestion of 57 avro files (stored on S3) with typical sizes of 70-128MB 
giving 2,75GB of input data. We create a spark data-frame loading all these 
files and writing into Hudi.
   This data is equal 186 million of rows.
   The schema of the table is composed of 7 string columns, 2 BigInt columns, 
partitioned by 3 String columns (Day, Month, Year).
   We also know that this table may have 5 to 10% of updates. (The data comes 
from a CDC engine)
   
   In the beginning, the same load duration was < 20minutes, but after loading 
3 years of data, the duration increased to 1,5hours, and after we upgrade hudi 
from 0.8 to 0.10 the duration decreased to 1,25 hours in average.
   
   I'm looking for any tuning tips that may help decreasing the load duration 
even more, and try to let this time more stable considering the data growth 
over time.
   
   **Other useful information:**
   
   **Hoodie configs:**
   "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
   "hoodie.datasource.write.payload.class": 
"org.apache.hudi.common.model.DefaultHoodieRecordPayload",
   "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
   "hoodie.table.name": table_name,
   "hoodie.datasource.write.recordkey.field": IDX_COL,
   "hoodie.datasource.write.partitionpath.field": pks,
   "hoodie.datasource.write.hive_style_partitioning": "true",
   "hoodie.datasource.write.precombine.field": tiebreaker,
   "hoodie.datasource.write.operation": operation,
   "hoodie.write.concurrency.mode": "single_writer",
   "hoodie.cleaner.commits.retained": 1,
   "hoodie.fail.on.timeline.archiving": False,
   "hoodie.keep.max.commits": 3,
   "hoodie.keep.min.commits": 2,
   "hoodie.bloom.index.use.caching": True,
   "hoodie.parquet.compression.codec": "snappy"
   
   **Environment Description**
   AWS Glue Job
   
   **Hudi version :**
   0.10
   
   **Spark version :**
   "Spark 2.4 - Python 3 with improved job times (Glue Version 2.0)"
   
   **Storage (HDFS/S3/GCS..) :**
   S3
   
   **Running on Docker? (yes/no) :**
   No
   
   **Additional context:**
   
   Infrastructure: Glue Job + S3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to