[I] [SUPPORT] Setting hoodie.datasource.insert.dup.policy to drop still upserts the record in 0.14 [hudi]

via GitHub Sun, 11 Feb 2024 08:17:14 -0800


keerthiskating opened a new issue, #10650:
URL: https://github.com/apache/hudi/issues/10650


   **Describe the problem you faced**
   
   If my incoming dataset already has a record which already exists in the hudi 
table, hudi is still updating the commit time and treating it as update even 
after setting  'hoodie.datasource.insert.dup.policy': 'drop',
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   ```
   recordkey = "id,name"
   precombine = "uuid"
   method = "upsert"
   table_type = "COPY_ON_WRITE"
   
   hudi_options = {
       'hoodie.table.name': table_name,
       'hoodie.datasource.write.recordkey.field': recordkey,
       'hoodie.datasource.insert.dup.policy': 'drop',
       'hoodie.datasource.write.table.name': table_name,
       'hoodie.datasource.write.operation': method,
       'hoodie.datasource.write.precombine.field': precombine,
       'hoodie.table.cdc.enabled':'true',
       'hoodie.table.cdc.supplemental.logging.mode': 'data_before_after',
   }
   
   spark_df = spark.createDataFrame(
       data=[
       (1, "John",  1, False),
       (2, "Doe",  2, False),
   ], 
   schema=["id", "name", "val", "_hoodie_is_deleted"])
   
   from pyspark.sql.functions import sha2, concat_ws
   
   record_key_col_array = recordkey.split(",")
   record_key_col_array
   spark_df = spark_df.withColumn("uuid", sha2(concat_ws("||", 
*record_key_col_array), 256))
   
   spark_df.write.format("hudi"). \
       options(**hudi_options). \
       mode("overwrite"). \
       save(path)
   
   df = spark. \
         read. \
         format("hudi"). \
         load(path)
   
   df.select(['_hoodie_commit_time', 'id', 'name', 'val']).show()
   
   +-------------------+---+----+---+
   |_hoodie_commit_time| id|name|val|
   +-------------------+---+----+---+
   |  20240211155820562|  1|John|  1|
   |  20240211155820562|  2| Doe|  2|
   +-------------------+---+----+---+
   
   
   spark_df = spark.createDataFrame(
       data=[
       (1, "John",  1, False)
   ], 
       schema=["id", "name", "val", "_hoodie_is_deleted"])
   spark_df = spark_df.withColumn("uuid", sha2(concat_ws("||", 
*record_key_col_array), 256))
   
   spark_df.write.format("hudi"). \
       options(**hudi_options). \
       mode("append"). \
       save(path)
   
   # read latest data
   
   df = spark. \
         read. \
         format("hudi"). \
         load(path)
   
   df.select(['_hoodie_commit_time', 'id', 'name', 'val']).show()
   
   +-------------------+---+----+---+
   |_hoodie_commit_time| id|name|val|
   +-------------------+---+----+---+
   |  20240211155914976|  1|John|  1| ---> Commit time has updated even though 
record did not change.
   |  20240211155820562|  2| Doe|  2|
   +-------------------+---+----+---+
   
   # query cdc data
   cdc_read_options = {
       'hoodie.datasource.query.incremental.format': 'cdc',
       'hoodie.datasource.query.type': 'incremental',
       'hoodie.datasource.read.begin.instanttime': latest_commmit_ts
       # 'hoodie.datasource.read.end.instanttime': 20240208210952160,
   }
   df=spark.read.format("hudi"). \
       options(**cdc_read_options). \
       load(path)
   
   df.show(2,False)
   
   
+---+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
   |op |ts_ms            |before                                                
                                                                                
      |after                                                                    
                                                                   |
   
+---+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
   |u  |20240211155914976|{"id": 1, "name": "John", "val": 1, 
"_hoodie_is_deleted": false, "uuid": 
"46ca69f145f50f414b7a8cd59656f4935a5162798f093edc708a1ba21c0e9c26"}|{"id": 1, 
"name": "John", "val": 1, "_hoodie_is_deleted": false, "uuid": 
"46ca69f145f50f414b7a8cd59656f4935a5162798f093edc708a1ba21c0e9c26"}|
   
+---+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
   
   ```
   **Expected behavior**
   
   Since no updates were made to any records, hudi should not report any 
updates when performing cdc query
   
   **Environment Description**
   
   * Hudi version : 0.14
   
   * Spark version : 3.3.0-amzn-1
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) : no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Setting hoodie.datasource.insert.dup.policy to drop still upserts the record in 0.14 [hudi]

Reply via email to