nmukerje opened a new issue #3321:
URL: https://github.com/apache/hudi/issues/3321


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? yes
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I am trying to set the "_hoodie_is_deleted" to "true" and do an UPSERT 
operation to a Hudi table. The record is not getting deleted. The goal is to 
mix deletes and upserts  eventually in the same transaction and for that we 
need "_hoodie_is_deleted" to be honored.
   
   As per docs, "Deletion takes the same path as upsert and so it relies on a 
specific field called “_hoodie_is_deleted” of type boolean in each record." and 
I believe from past discussions that this should work both for 
HoodieDeltaStreamer as well as from Spark Data source.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Insert a few records to a few table.
   2. Set _hoodie_is_deleted to true and try to delete a few records. 
   3. This will actually fail as hudi expects _hoodie_is_deleted to be not null 
even for existing records.
   
   ```
   Caused by: java.lang.RuntimeException: Null-value for required field: 
_hoodie_is_deleted
        at 
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:194)
        at 
org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
        at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
        at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
        at 
org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:94)
        at 
org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:251)
        ... 8 more
   ```
   
   4. Re-Insert the initial records with _hoodie_is_deleted set to "false". 
   5. Set _hoodie_is_deleted to true and try to delete a few records agan.
   6. Hudi does not a Upsert and both  "_hoodie_is_deleted" values of true and 
false are visible in the table.
   
   ```
   +-------------------+------------------+-----------------+
   |_hoodie_commit_time|_hoodie_is_deleted|committed_records|
   +-------------------+------------------+-----------------+
   |20210721190717     |false             |19               |
   |20210721190743     |true              |1                |
   +-------------------+------------------+-----------------+
   ```
   
   **Expected behavior**
   
   The expected behavior is :
   1/  "_hoodie_is_deleted" should not be needed to set to "false" for all 
records.
   2/ that record upserted with the column "_hoodie_is_deleted " set to "true" 
should get deleted.
   
   **Environment Description**
   
   * Hudi version : 0..7.0 installed in EMR 6.3.0
   
   * Spark version : 3.1.1
   
   * Hive version : 3.1.2
   
   * Hadoop version :  3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Sample reproducible test is here - 
https://github.com/nmukerje/misc/blob/master/Hudi/Hudi_Pyspark_Delete_Upsert_Test.ipynb
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   Already included above.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to