geoffroyatkwiff opened a new issue #4778:
URL: https://github.com/apache/hudi/issues/4778


   **Describe the problem you faced**
   
   I am using the `_hoodie_is_deleted` column but some rows are written to the 
target table shouldn't.
   When both the "Insert" and "Delete" (the delete being for said inserted row) 
are in the same "source" parquet file, the "Delete" is stored into the target 
table.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a table using one parquet file (containing only Inserts) as the 
source
   2. Generate a parquet file that stores the following incremental changes:
       - a Delete for one of the row that's currently in the target table
       - an Insert of a new row
       - a Delete of this new row that was just inserted
   3. Add value to the `_hoodie_is_deleted` column accordingly and 
process/write the dataframe to the target table. If I follow the above, the 
rows will have the following values in this column, repsectively: True, False, 
True
   4. The row that already was in the target table and deleted in the last 
update is indeed deleted
   5. This other row whose `Insert` and `Delete` operations were stored in the 
same source parquet file (and so the Insert and Delete are in the same 
dataframe that was just processed) is present in the target table.
   
   **Expected behavior**
   
   The row whose `Insert` and `Delete` operations were stored in the same 
source parquet file shouldn't be written to the target table.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to