noahtaite opened a new issue, #10172:
URL: https://github.com/apache/hudi/issues/10172

   **Describe the problem you faced**
   
   We generated a medium-sized MoR table using bulk_insert with the following 
dimensions:
   - 3000 partitions 
   - xx TB data
   - xx S3 objects
   
   Since we have many small files due to bulk_insert not automatically handling 
file sizing, we need to run clustering on the table to improve downstream read 
performance. 
   
   After running clustering and counting the data, my count has grown from 
177,822,668 to 177,828,417 (a count difference of ~6k records). When I run an 
**except()** between the clustered and control dataset, it outputs 3,127,201 
records.
   
   I am trying to understand why there is a difference in count after running 
clustering and why there are 3M supposedly different records even though I have 
not changed the following default configuration:
   ```
   hoodie.clustering.preserve.commit.metadata
   When rewriting data, preserves existing hoodie_commit_time
   Default Value: true (Optional)
   Config Param: PRESERVE_COMMIT_METADATA
   Since Version: 0.9.0
   ```
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Generate MoR table and update many times without running compaction.
   2. Run a snapshot query against the table to get the count.
   3. Run clustering on the table.
   4. Run another snapshot query.
   5. Observe the count has changed 
   
   **Expected behavior**
   
   I expected the count to remain the same. I did see new files created from 
the clustering process.
   
   **Environment Description**
   
   * Hudi version : 0.12.1-amzn-0
   
   * Spark version : 3.3.0
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   - We noticed today that compaction has never been run on this table. 
Wondering if that has any impact on clustering?
   
   
   **Stacktrace**
   
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to