[I] [SUPPORT] Upsert writing too many records [hudi]

via GitHub Fri, 13 Oct 2023 13:10:02 -0700


szingerpeter opened a new issue, #9860:
URL: https://github.com/apache/hudi/issues/9860


   **Describe the problem you faced**
   
   Upserts are taking a really long time and currently i just noticed that hudi 
writes a lot more records during a commit than what the inserts/updates are:
   ```
   commit.totalRecordsWritten: 48,528,843
   commit.totalUpdateRecordsWritten: 437,833
   commit.totalInsertRecordsWritten: 212,220
   ```
   
   Disclaimer: i tried looking at the Tuning and Performance guides of the hudi 
website, but i would need a bit more help if possible.
   
   **To Reproduce**
   
   The use case is ingesting transactional data related to multiple concepts 
from source systems using spark on a daily basis, where we use hudi to take 
care of upserts.
   - In most cases, we have a transaction table and an item table, which 
represent the items assigned to these transactions.
   - Most of the time transactions have a created_datetime; items occasionally 
have such field.
   - Also, most of the time the PK is an integer; however, not always 
auto_incrementing.
   - The precombine field is either an incrementing integer (based on when the 
data was received (eg: today it would have a value of 20231013)) field or a 
timestamp.
   - Most of the updates are within a couple of days/weeks, but updates can 
happen even after years (that's a really rare scenario, but a possibility)
   
   
   currently this is our write settings for all tables:
   ```
   'hoodie.datasource.write.table.type': 'COPY_ON_WRITE'
   'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.CustomKeyGenerator'
   'hoodie.datasource.write.partitionpath.field': ''
   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.NonPartitionedExtractor'
   'hoodie.parquet.max.file.size': '1500000000'
   'hoodie.parquet.small.file.limit': '1000000000'
   'hoodie.datasource.hive_sync.enable': True
   'hoodie.datasource.hive_sync.support_timestamp': True
   'hoodie.embed.timeline.server': False
   ```
   
   we have several scenarios, but the behaviour above pretty much applies to 
each of them.
   
   i believe, this could be improved via correct partitioning and indexing of 
the data.
   however, i'm a bit uncertain about what partitioning/index to use for some 
scenarios.
   
   whenever we have a consistent created_datetime for the PK - i guess it makes 
most sense to use the created_datetime as the partitioning field (if there is a 
better way, please let me know) (also, what would be the best indexing strategy 
here when the PK is auto_increment vs random ints)
   
   (i couldn't really find if dynamic partitioning a possibility in hudi - eg: 
records older than 2 years are partitioned by year; records > 1yrs partitioned 
by year and month; records < 1yrs partitioned by year and month and day)?
   
   Also, do you have some recommendations what to do when:
   we don't have a consistent created_datetime for the PK (and nothing from the 
table seems like a good candidate for partitioning/indexing) (this could apply 
to both transactions and items as described above)
   - in some of these cases the PK is auto_increment
   - in some other cases the PK is quite random (after seeing a value of 
2Billion something, the next value is 400Million something)
   
   also, is there a built-in partitioning logic for hudi, where the partitions 
are calculated on the hash of the PK - or potentially we can also implement a 
logic, where a set of PKs will always be under the same partition - however, 
that wouldn't automatically scale (do you have some recommendations for that?)
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.11.0-amzn-0
   
   * Spark version : 3.2.1
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : NO
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Upsert writing too many records [hudi]

Reply via email to