afeldman1 opened a new issue #2399:
URL: https://github.com/apache/hudi/issues/2399


   Using  a COW table, I tried deleting using both methods:
   1) Adding a _hoodie_is_deleted column with .withColumn("_hoodie_is_deleted", 
lit(true)) and then writing as an upsert
   2) Saving the list of rows to be deleted with 
"hoodie.datasource.write.operation" set to 
DataSourceWriteOptions.DELETE_OPERATION_OPT_VAL ("delete")
   
   EMR claims the step completed successfully however the rows are not removed.
   
   Existing hudi table state:
   
+-------------------+--------------------+------------------+----------------------+--------------------+--------------+------+---------+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name|col_a|col_b|col_c|
   
+-------------------+--------------------+------------------+----------------------+--------------------+--------------+------+---------+
   |     20201231034701|  20201231034701_2_3|      col_b:42|                 
10/42|e716352e-cea3-441...|      87679865|    10|       42|
   |     20201231034701|  20201231034701_3_4|      col_b:25|                  
1/25|49768dca-3b24-459...|        123456|     1|       25|
   |     20201231034701|  20201231034701_1_2|      col_b:27|                  
2/27|4d1724fb-c19b-4e3...|        188303|     2|       27|
   |     20201231034701|  20201231034701_0_1|      col_b:60|                  
3/60|423b4820-0a28-4ac...|        199303|     3|       60|
   
+-------------------+--------------------+------------------+----------------------+--------------------+--------------+------+---------+
   
   records to be deleted (matches the first row in the above table):
   +-------------+------+----+
   |col_a|col_b|col_c|
   +-------------+------+----+
   |      87679865|    10|       42|
   +-------------+------+----+
   
   In the commit log present in the .hoodie folder for the delete commit, we 
can see the commit being made, but no records updated.
   
   {
     "partitionToWriteStats" : { },
     "compacted" : false,
     "extraMetadata" : {
       "schema" : 
"{\"type\":\"record\",\"name\":\"test_tbl_nm_record\",\"namespace\":\"hoodie.test_tbl_nm\",\"fields\":[{\"name\":\"col_a\",\"type\":\"int\"},{\"name\":\"col_b\",\"type\":\"int\"},{\"name\":\"col_c\",\"type\":\"int\"}]}"
     },
     "operationType" : "DELETE",
     "totalScanTime" : 0,
     "totalCreateTime" : 0,
     "totalUpsertTime" : 0,
     "totalCompactedRecordsUpdated" : 0,
     "totalLogFilesCompacted" : 0,
     "totalLogFilesSize" : 0,
     "fileIdAndRelativePaths" : { },
     "totalRecordsDeleted" : 0,
     "totalLogRecordsCompacted" : 0
   }
   
   This doesn't seem to be functioning correctly?
   
   
   **Environment Description**
   Hudi version: 0.6.0
   Spark version: 2.4.6
   Hive version: 2.3.7
   Hadoop version: 2.10.0
   Storage: S3 with Glue metastore
   Running on EMR 5.31.0
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to