venkee14 opened a new issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected URL: https://github.com/apache/incubator-hudi/issues/1482 I am trying to get "Deletion with HoodieDeltaStreamer" working for my existing dataset in Hudi. I am following - https://cwiki.apache.org/confluence/display/HUDI/2020/01/15/Delete+support+in+Hudi My initial dataset exists without "_hoodie_is_deleted" key, I am trying to upsert the records with this key for all incoming records , my code - <code> Dataset<Row> deletedRows = dataframe.filter(dataframe.col(this.deleteKey).equalTo(this.deleteValue)); Dataset<Row> remainingRows = dataframe.filter(dataframe.col(this.deleteKey).notEqual(this.deleteValue)); deletedRows = deletedRows.withColumn("_hoodie_is_deleted", lit(true)); remainingRows = remainingRows.withColumn("_hoodie_is_deleted", lit(false)); dataframe = deletedRows.union(remainingRows); </code> I have noticed that, the upsert runs fine, when the record to be deleted is the only record in the parquet file. But fails with below error - Null-value for required field: _hoodie_is_deleted at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:194) When there are other records in the parquet file. Would appreciate any help here **To Reproduce** Steps to reproduce the behavior: 1. Load initial dataset without _hoodie_is_deleted in the schema 2. Pick a record from a parquet file, which has multiple records 3. Delete this record by adding _hoodie_is_deleted : true, pass this flag for all incoming upserts. 4. Throws "Null-value for required field: _hoodie_is_deleted" Works when the record record to be deleted is the only record on the parquet file **Expected behavior** Only a single record has to be deleted on the parquet file and all other records should exist and the upsert should not throw "Null-value for required field: _hoodie_is_deleted" **Environment Description** * Hudi version : 0.5.1 * Spark version : 2.2 * EMR Version: emr-5.28 * Hive version : NA * Hadoop version : NA * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no StackTrace : Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Null-value for required field: _hoodie_is_deleted at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:143) at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:204) ... 32 more Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Null-value for required field: _hoodie_is_deleted at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:141) ... 33 more Caused by: java.lang.RuntimeException: Null-value for required field: _hoodie_is_deleted at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:194) at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299) at org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103) at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:296) at org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:434) at org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:424) at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more )
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
