sleapfish opened a new issue #2522:
URL: https://github.com/apache/hudi/issues/2522


   **Problem**
   
   When the source data set has unchanged rows, Hudi will upsert the target 
table rows and include those records in the new commit. If you have a 
CDC/incremental logic where you might have identical records from previous 
insert, new records, and changed records. Hudi would upsert all new, changed 
and unchanged records - and they would all be part of a new commit.
   
   Now when you want to query increments, the result will include lot of 
unnecessary (unchanged) rows as well. I would like to avoid that. Is there a 
way to somehow drop unchanged rows from source?
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Fully load Hudi table
   
   Target example:
   ```
   ---------------------------------------------------------------------
   |     row_key    |     att_1      |      att_2     |    commit      |
   ---------------------------------------------------------------------
   |        1       |      1_1       |       1_2      |        0       |
   ---------------------------------------------------------------------
   |        2       |      2_1       |       2_2      |        0       |
   ---------------------------------------------------------------------
   ```
   2. Incrementally upsert new data set (Incremental data set should include 
unchanged records)
   
   Incremental data:
   ```
   ----------------------------------------------------
   |     row_key    |     att_1      |      att_2     |  
   ----------------------------------------------------
   |        1       |      1_1       |       1_2      |
   ----------------------------------------------------
   |        2       |      2_1       |    changed     |
   ----------------------------------------------------
   |        3       |      3_1       |       3_2      |
   ----------------------------------------------------
   |        4       |      4_1       |       4_2      |
   ----------------------------------------------------
   ```
   3. Incrementally query Hudi table for the latest commit
   
   Target example:
   ```
   ---------------------------------------------------------------------
   |     row_key    |     att_1      |      att_2     |    commit      |
   ---------------------------------------------------------------------
   |        1       |      1_1       |       1_2      |        1       |
   ---------------------------------------------------------------------
   |        2       |      2_1       |    changed     |        1       |
   ---------------------------------------------------------------------
   |        3       |      3_1       |       3_2      |        1       |
   ---------------------------------------------------------------------
   |        4       |      4_1       |       4_2      |        1       |
   ---------------------------------------------------------------------
   ```
   **Expected behavior**
   
   Target example:
   ```
   ---------------------------------------------------------------------
   |     row_key    |     att_1      |      att_2     |    commit      |
   ---------------------------------------------------------------------
   |        1       |      1_1       |       1_2      |        0       |
   ---------------------------------------------------------------------
   |        2       |      2_1       |    changed     |        1       |
   ---------------------------------------------------------------------
   |        3       |      3_1       |       3_2      |        1       |
   ---------------------------------------------------------------------
   |        4       |      4_1       |       4_2      |        1       |
   ---------------------------------------------------------------------
   ```
   
   **Environment Description**
   
   * Hudi version : 0.5.3
   * Spark version : 2.4.5
   * Storage (HDFS/S3/GCS..) : S3
   
   Thank you in advance!
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to