sleapfish opened a new issue #2522: URL: https://github.com/apache/hudi/issues/2522
**Problem** When the source data set has unchanged rows, Hudi will upsert the target table rows and include those records in the new commit. If you have a CDC/incremental logic where you might have identical records from previous insert, new records, and changed records. Hudi would upsert all new, changed and unchanged records - and they would all be part of a new commit. Now when you want to query increments, the result will include lot of unnecessary (unchanged) rows as well. I would like to avoid that. Is there a way to somehow drop unchanged rows from source? **To Reproduce** Steps to reproduce the behavior: 1. Fully load Hudi table Target example: ``` --------------------------------------------------------------------- | row_key | att_1 | att_2 | commit | --------------------------------------------------------------------- | 1 | 1_1 | 1_2 | 0 | --------------------------------------------------------------------- | 2 | 2_1 | 2_2 | 0 | --------------------------------------------------------------------- ``` 2. Incrementally upsert new data set (Incremental data set should include unchanged records) Incremental data: ``` ---------------------------------------------------- | row_key | att_1 | att_2 | ---------------------------------------------------- | 1 | 1_1 | 1_2 | ---------------------------------------------------- | 2 | 2_1 | changed | ---------------------------------------------------- | 3 | 3_1 | 3_2 | ---------------------------------------------------- | 4 | 4_1 | 4_2 | ---------------------------------------------------- ``` 3. Incrementally query Hudi table for the latest commit Target example: ``` --------------------------------------------------------------------- | row_key | att_1 | att_2 | commit | --------------------------------------------------------------------- | 1 | 1_1 | 1_2 | 1 | --------------------------------------------------------------------- | 2 | 2_1 | changed | 1 | --------------------------------------------------------------------- | 3 | 3_1 | 3_2 | 1 | --------------------------------------------------------------------- | 4 | 4_1 | 4_2 | 1 | --------------------------------------------------------------------- ``` **Expected behavior** Target example: ``` --------------------------------------------------------------------- | row_key | att_1 | att_2 | commit | --------------------------------------------------------------------- | 1 | 1_1 | 1_2 | 0 | --------------------------------------------------------------------- | 2 | 2_1 | changed | 1 | --------------------------------------------------------------------- | 3 | 3_1 | 3_2 | 1 | --------------------------------------------------------------------- | 4 | 4_1 | 4_2 | 1 | --------------------------------------------------------------------- ``` **Environment Description** * Hudi version : 0.5.3 * Spark version : 2.4.5 * Storage (HDFS/S3/GCS..) : S3 Thank you in advance! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
