keerthiskating opened a new issue, #10650:
URL: https://github.com/apache/hudi/issues/10650
**Describe the problem you faced**
If my incoming dataset already has a record which already exists in the hudi
table, hudi is still updating the commit time and treating it as update even
after setting 'hoodie.datasource.insert.dup.policy': 'drop',
**To Reproduce**
Steps to reproduce the behavior:
```
recordkey = "id,name"
precombine = "uuid"
method = "upsert"
table_type = "COPY_ON_WRITE"
hudi_options = {
'hoodie.table.name': table_name,
'hoodie.datasource.write.recordkey.field': recordkey,
'hoodie.datasource.insert.dup.policy': 'drop',
'hoodie.datasource.write.table.name': table_name,
'hoodie.datasource.write.operation': method,
'hoodie.datasource.write.precombine.field': precombine,
'hoodie.table.cdc.enabled':'true',
'hoodie.table.cdc.supplemental.logging.mode': 'data_before_after',
}
spark_df = spark.createDataFrame(
data=[
(1, "John", 1, False),
(2, "Doe", 2, False),
],
schema=["id", "name", "val", "_hoodie_is_deleted"])
from pyspark.sql.functions import sha2, concat_ws
record_key_col_array = recordkey.split(",")
record_key_col_array
spark_df = spark_df.withColumn("uuid", sha2(concat_ws("||",
*record_key_col_array), 256))
spark_df.write.format("hudi"). \
options(**hudi_options). \
mode("overwrite"). \
save(path)
df = spark. \
read. \
format("hudi"). \
load(path)
df.select(['_hoodie_commit_time', 'id', 'name', 'val']).show()
+-------------------+---+----+---+
|_hoodie_commit_time| id|name|val|
+-------------------+---+----+---+
| 20240211155820562| 1|John| 1|
| 20240211155820562| 2| Doe| 2|
+-------------------+---+----+---+
spark_df = spark.createDataFrame(
data=[
(1, "John", 1, False)
],
schema=["id", "name", "val", "_hoodie_is_deleted"])
spark_df = spark_df.withColumn("uuid", sha2(concat_ws("||",
*record_key_col_array), 256))
spark_df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save(path)
# read latest data
df = spark. \
read. \
format("hudi"). \
load(path)
df.select(['_hoodie_commit_time', 'id', 'name', 'val']).show()
+-------------------+---+----+---+
|_hoodie_commit_time| id|name|val|
+-------------------+---+----+---+
| 20240211155914976| 1|John| 1| ---> Commit time has updated even though
record did not change.
| 20240211155820562| 2| Doe| 2|
+-------------------+---+----+---+
# query cdc data
cdc_read_options = {
'hoodie.datasource.query.incremental.format': 'cdc',
'hoodie.datasource.query.type': 'incremental',
'hoodie.datasource.read.begin.instanttime': latest_commmit_ts
# 'hoodie.datasource.read.end.instanttime': 20240208210952160,
}
df=spark.read.format("hudi"). \
options(**cdc_read_options). \
load(path)
df.show(2,False)
+---+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
|op |ts_ms |before
|after
|
+---+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
|u |20240211155914976|{"id": 1, "name": "John", "val": 1,
"_hoodie_is_deleted": false, "uuid":
"46ca69f145f50f414b7a8cd59656f4935a5162798f093edc708a1ba21c0e9c26"}|{"id": 1,
"name": "John", "val": 1, "_hoodie_is_deleted": false, "uuid":
"46ca69f145f50f414b7a8cd59656f4935a5162798f093edc708a1ba21c0e9c26"}|
+---+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
```
**Expected behavior**
Since no updates were made to any records, hudi should not report any
updates when performing cdc query
**Environment Description**
* Hudi version : 0.14
* Spark version : 3.3.0-amzn-1
* Storage (HDFS/S3/GCS..) : s3
* Running on Docker? (yes/no) : no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]