venkee14 opened a new issue #1482: [SUPPORT] Deletion of records through 
deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482
 
 
   I am trying to get "Deletion with HoodieDeltaStreamer" working for my 
existing dataset in Hudi. I am following - 
https://cwiki.apache.org/confluence/display/HUDI/2020/01/15/Delete+support+in+Hudi
   My initial dataset exists without "_hoodie_is_deleted" key, I am trying to 
upsert the records with this key for all incoming records , my code -
   <code>
   Dataset<Row> deletedRows = 
dataframe.filter(dataframe.col(this.deleteKey).equalTo(this.deleteValue));
   Dataset<Row> remainingRows = 
dataframe.filter(dataframe.col(this.deleteKey).notEqual(this.deleteValue));
   deletedRows = deletedRows.withColumn("_hoodie_is_deleted", lit(true));
   remainingRows = remainingRows.withColumn("_hoodie_is_deleted", lit(false));
   dataframe = deletedRows.union(remainingRows);
   </code>
   I have noticed that, the upsert runs fine, when the record to be deleted is 
the only record in the parquet file. But fails with below error -
   Null-value for required field: _hoodie_is_deleted
        at 
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:194)
   
   When there are other records in the parquet file. Would appreciate any help 
here
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Load initial dataset without _hoodie_is_deleted in the schema
   2. Pick a record from a parquet file, which has multiple records 
   3. Delete this record by adding _hoodie_is_deleted : true, pass this flag 
for all incoming upserts.
   4. Throws "Null-value for required field: _hoodie_is_deleted"
   
   Works when the record record to be deleted is the only record on the parquet 
file
   
   **Expected behavior**
   
   Only a single record has to be deleted on the parquet file and all other 
records should exist and the upsert should not throw "Null-value for required 
field: _hoodie_is_deleted"
   
   **Environment Description**
   
   * Hudi version : 0.5.1
   
   * Spark version : 2.2
   
   * EMR Version: emr-5.28
   
   * Hive version : NA
   
   * Hadoop version : NA
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   StackTrace : 
   
   Caused by: org.apache.hudi.exception.HoodieException: 
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Null-value 
for required field: _hoodie_is_deleted
        at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:143)
        at 
org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:204)
        ... 32 more
   Caused by: java.util.concurrent.ExecutionException: 
java.lang.RuntimeException: Null-value for required field: _hoodie_is_deleted
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:141)
        ... 33 more
   Caused by: java.lang.RuntimeException: Null-value for required field: 
_hoodie_is_deleted
        at 
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:194)
        at 
org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
        at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
        at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
        at 
org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
        at 
org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:296)
        at 
org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:434)
        at 
org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:424)
        at 
org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
        at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more
   )
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to