Phil Chen created HUDI-2467:
-------------------------------

             Summary: Delete data is not working with 0.9.0 
                 Key: HUDI-2467
                 URL: https://issues.apache.org/jira/browse/HUDI-2467
             Project: Apache Hudi
          Issue Type: Bug
          Components: Spark Integration
            Reporter: Phil Chen


Following this spark guide:
[https://hudi.apache.org/docs/quick-start-guide/]

Everything works until delete data:

I'm using Pyspark with Spark 3.1.2 with python 3.9
{code:java}
// code placeholder
# pyspark# fetch total records count

spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()

# fetch two records to be deleted

ds = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").limit(2)
# issue deletes
hudi_delete_options = {  'hoodie.table.name': tableName,  
'hoodie.datasource.write.recordkey.field': 'uuid',  
'hoodie.datasource.write.partitionpath.field': 'partitionpath',  
'hoodie.datasource.write.table.name': tableName,  
'hoodie.datasource.write.operation': 'delete',  
'hoodie.datasource.write.precombine.field': 'ts',  
'hoodie.upsert.shuffle.parallelism': 2,   
'hoodie.insert.shuffle.parallelism': 2}

from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['uuid', 
'partitionpath']).withColumn('ts', lit(0.0))

df.write.format("hudi"). \  
  options(**hudi_delete_options). \
  mode("append"). \
  save(basePath)

# run the same read query as above.
roAfterDeleteViewDF = spark. \
  read. \
  format("hudi"). \
  load(basePath) 

roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
# fetch should return (total - 2) records

spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count(){code}

The count before delete is 10 and after delete is still 10 (expecting 8)


{code:java}
// code placeholder
>>> df.show()
+--------------------+--------------------+---+
|       partitionpath|                uuid| ts|
+--------------------+--------------------+---+
|74bed794-c854-4aa...|americas/united_s...|0.0|
|ce71c2dc-dedf-483...|americas/united_s...|0.0|
+--------------------+--------------------+---+
{code}
 

The 2 records to be deleted


Note, the 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to