WTa-hash opened a new issue #2057: URL: https://github.com/apache/hudi/issues/2057
I am having an issue where Hudi does not process deletes correctly when insert + delete for a particular record exist within the same batch. The original issue is reported here: https://issues.apache.org/jira/browse/HUDI-802, but marked as closed for 0.6.0 release. Below are app versions: AWS EMR: 5.30.1 Hudi version : 0.6.0 Spark version : 2.4.5 Hive version : 2.3.6 Hadoop version : 2.8.5 Storage (HDFS/S3/GCS..) : S3 Running on Docker? (yes/no) : no Attached is the script to reproduce: [Example.txt](https://github.com/apache/hudi/files/5150354/Example.txt) Basically, I have a dataframe where ID=3 is marked as deleted: +---+-------+-------------+-------+-------------------+----+ |id |name |desc |groupId|__timestamp |Op | +---+-------+-------------+-------+-------------------+----+ |1 |Bob |Manager II |100 |1970-01-01 00:00:00|null| |2 |John |Associate I |200 |1970-01-01 00:00:00|null| |3 |Michael|null |200 |1970-01-01 00:00:00|null| |3 |Michael|null |200 |2020-01-04 00:00:00|D | |4 |William|Manager I |100 |1998-04-13 00:00:00|I | |5 |Fred |Associate III|200 |2020-11-01 00:00:00|I | +---+-------+-------------+-------+-------------------+----+ However, in the resulting Hudi table, ID=3 gets inserted instead of ignored: +-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---+-------+-------------+-------+-------------------+----+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id |name |desc |groupId|__timestamp |Op| +-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---+-------+-------------+-------+-------------------+----+ |20200831121552 |20200831121552_0_2 |id:1 |100 |b5f359af-37b5-4238-8138-1b8ce82179fe-0_0-22-12017_20200831121552.parquet|1 |Bob |Manager II |100 |1970-01-01 00:00:00|null| |20200831121552 |20200831121552_1_1 |id:2 |200 |05c3d5c6-4696-41d3-a6d3-9b34d63eeb68-0_1-22-12018_20200831121552.parquet|2 |John |Associate I |200 |1970-01-01 00:00:00|null| |20200831121552 |20200831121552_1_2 |id:3 |200 |05c3d5c6-4696-41d3-a6d3-9b34d63eeb68-0_1-22-12018_20200831121552.parquet|3 |Michael|null |200 |2020-01-04 00:00:00|D | |20200831121552 |20200831121552_0_1 |id:4 |100 |b5f359af-37b5-4238-8138-1b8ce82179fe-0_0-22-12017_20200831121552.parquet|4 |William|Manager I |100 |1998-04-13 00:00:00|I | |20200831121552 |20200831121552_1_3 |id:5 |200 |05c3d5c6-4696-41d3-a6d3-9b34d63eeb68-0_1-22-12018_20200831121552.parquet|5 |Fred |Associate III|200 |2020-11-01 00:00:00|I | +-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---+-------+-------------+-------+-------------------+----+ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
