szingerpeter opened a new issue, #9860: URL: https://github.com/apache/hudi/issues/9860
**Describe the problem you faced** Upserts are taking a really long time and currently i just noticed that hudi writes a lot more records during a commit than what the inserts/updates are: ``` commit.totalRecordsWritten: 48,528,843 commit.totalUpdateRecordsWritten: 437,833 commit.totalInsertRecordsWritten: 212,220 ``` Disclaimer: i tried looking at the Tuning and Performance guides of the hudi website, but i would need a bit more help if possible. **To Reproduce** The use case is ingesting transactional data related to multiple concepts from source systems using spark on a daily basis, where we use hudi to take care of upserts. - In most cases, we have a transaction table and an item table, which represent the items assigned to these transactions. - Most of the time transactions have a created_datetime; items occasionally have such field. - Also, most of the time the PK is an integer; however, not always auto_incrementing. - The precombine field is either an incrementing integer (based on when the data was received (eg: today it would have a value of 20231013)) field or a timestamp. - Most of the updates are within a couple of days/weeks, but updates can happen even after years (that's a really rare scenario, but a possibility) currently this is our write settings for all tables: ``` 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE' 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator' 'hoodie.datasource.write.partitionpath.field': '' 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor' 'hoodie.parquet.max.file.size': '1500000000' 'hoodie.parquet.small.file.limit': '1000000000' 'hoodie.datasource.hive_sync.enable': True 'hoodie.datasource.hive_sync.support_timestamp': True 'hoodie.embed.timeline.server': False ``` we have several scenarios, but the behaviour above pretty much applies to each of them. i believe, this could be improved via correct partitioning and indexing of the data. however, i'm a bit uncertain about what partitioning/index to use for some scenarios. whenever we have a consistent created_datetime for the PK - i guess it makes most sense to use the created_datetime as the partitioning field (if there is a better way, please let me know) (also, what would be the best indexing strategy here when the PK is auto_increment vs random ints) (i couldn't really find if dynamic partitioning a possibility in hudi - eg: records older than 2 years are partitioned by year; records > 1yrs partitioned by year and month; records < 1yrs partitioned by year and month and day)? Also, do you have some recommendations what to do when: we don't have a consistent created_datetime for the PK (and nothing from the table seems like a good candidate for partitioning/indexing) (this could apply to both transactions and items as described above) - in some of these cases the PK is auto_increment - in some other cases the PK is quite random (after seeing a value of 2Billion something, the next value is 400Million something) also, is there a built-in partitioning logic for hudi, where the partitions are calculated on the hash of the PK - or potentially we can also implement a logic, where a set of PKs will always be under the same partition - however, that wouldn't automatically scale (do you have some recommendations for that?) **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.11.0-amzn-0 * Spark version : 3.2.1 * Hive version : 3.1.3 * Hadoop version : 3.2.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : NO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
