geoffroyatkwiff opened a new issue #4778:
URL: https://github.com/apache/hudi/issues/4778
**Describe the problem you faced**
I am using the `_hoodie_is_deleted` column but some rows are written to the
target table shouldn't.
When both the "Insert" and "Delete" (the delete being for said inserted row)
are in the same "source" parquet file, the "Delete" is stored into the target
table.
**To Reproduce**
Steps to reproduce the behavior:
1. Create a table using one parquet file (containing only Inserts) as the
source
2. Generate a parquet file that stores the following incremental changes:
- a Delete for one of the row that's currently in the target table
- an Insert of a new row
- a Delete of this new row that was just inserted
3. Add value to the `_hoodie_is_deleted` column accordingly and
process/write the dataframe to the target table. If I follow the above, the
rows will have the following values in this column, repsectively: True, False,
True
4. The row that already was in the target table and deleted in the last
update is indeed deleted
5. This other row whose `Insert` and `Delete` operations were stored in the
same source parquet file (and so the Insert and Delete are in the same
dataframe that was just processed) is present in the target table.
**Expected behavior**
The row whose `Insert` and `Delete` operations were stored in the same
source parquet file shouldn't be written to the target table.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]