noahtaite opened a new issue, #10172: URL: https://github.com/apache/hudi/issues/10172
**Describe the problem you faced** We generated a medium-sized MoR table using bulk_insert with the following dimensions: - 3000 partitions - xx TB data - xx S3 objects Since we have many small files due to bulk_insert not automatically handling file sizing, we need to run clustering on the table to improve downstream read performance. After running clustering and counting the data, my count has grown from 177,822,668 to 177,828,417 (a count difference of ~6k records). When I run an **except()** between the clustered and control dataset, it outputs 3,127,201 records. I am trying to understand why there is a difference in count after running clustering and why there are 3M supposedly different records even though I have not changed the following default configuration: ``` hoodie.clustering.preserve.commit.metadata When rewriting data, preserves existing hoodie_commit_time Default Value: true (Optional) Config Param: PRESERVE_COMMIT_METADATA Since Version: 0.9.0 ``` **To Reproduce** Steps to reproduce the behavior: 1. Generate MoR table and update many times without running compaction. 2. Run a snapshot query against the table to get the count. 3. Run clustering on the table. 4. Run another snapshot query. 5. Observe the count has changed **Expected behavior** I expected the count to remain the same. I did see new files created from the clustering process. **Environment Description** * Hudi version : 0.12.1-amzn-0 * Spark version : 3.3.0 * Hive version : 3.1.3 * Hadoop version : 3.3.3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** - We noticed today that compaction has never been run on this table. Wondering if that has any impact on clustering? **Stacktrace** N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
