uvplearn opened a new issue, #5869:
URL: https://github.com/apache/hudi/issues/5869
**Desciption**
There are duplicate values in HUDI MOR table for different partition and
not updating values in same partition for GLOBAL_BLOOM.
**Steps To Reproduce this behavior**
**STEP 1**
I have created a hudi table with follwing input data and properties.
hudi_options = {
'hoodie.table.name': 'my_hudi_table',
'hoodie.datasource.write.recordkey.field':
'id',
'hoodie.datasource.write.partitionpath.field':
'creation_date',
'hoodie.datasource.write.precombine.field':
'last_update_time',
'hoodie.datasource.write.table.type':
'MERGE_ON_READ' ,
'hoodie.bloom.index.update.partition.path':
'true',
"hoodie.index.type": "GLOBAL_BLOOM",
"hoodie.datasource.write.keygenerator.class" :
"org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.hive_style_partitioning": 'true',
'hoodie.datasource.hive_sync.assume_date_partitioning':'false',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.database':'pfg_silver_fantasy',
'hoodie.datasource.hive_sync.table':
'hudi_test1',
'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
'hoodie.datasource.hive_sync.support_timestamp': 'true',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor'
}
# Create a DataFrame
inputDF = spark.createDataFrame(
[
("100", "2015-01-01", "1", 'a'),
("101", "2015-01-01", "1", 'a'),
],
["id", "creation_date", "last_update_time","new_col"]
)
# Write a DataFrame as a Hudi dataset
inputDF.write \
.format('org.apache.hudi') \
.options(**hudi_options) \
.mode('overwrite') \
.save('s3://<loc>/hudi_test1')
**Output after step1 in _rt table:**
"_hoodie_commit_time" "_hoodie_commit_seqno" "_hoodie_record_key"
"_hoodie_partition_path" "_hoodie_file_name" id
last_update_time new_col creation_date
20220615024525 20220615024525_0_1 id:101
creation_date=2015-01-01
cb8df2b4-1268-48b3-8665-1e4ac1196734-0_0-58-25650_20220615024525.parquet
101 1 a 2015-01-01
20220615024525 20220615024525_0_2 id:100
creation_date=2015-01-01
cb8df2b4-1268-48b3-8665-1e4ac1196734-0_0-58-25650_20220615024525.parquet
100 1 a 2015-01-01
**Step3: Upserting**
inputDF = spark.createDataFrame(
[
("100", "2015-01-02", "2","b"),
("101", "2015-01-01", "2","b")
],
["id", "creation_date",
"last_update_time","new_col"]
)
inputDF.write \
.format('org.apache.hudi') \
.options(**hudi_options) \
.mode('append') \
.option('hoodie.datasource.write.operation',
'upsert') \
.save('s3://<loc>/hudi_test2')
**Output after step3 in _rt table :**
"_hoodie_commit_time" "_hoodie_commit_seqno" "_hoodie_record_key"
"_hoodie_partition_path" "_hoodie_file_name" id
last_update_time new_col creation_date
20220615024525 20220615024525_0_1 id:101
creation_date=2015-01-01
cb8df2b4-1268-48b3-8665-1e4ac1196734-0_0-58-25650_20220615024525.parquet
101 1 a 2015-01-01
20220615024525 20220615024525_0_2 id:100
creation_date=2015-01-01
cb8df2b4-1268-48b3-8665-1e4ac1196734-0_0-58-25650_20220615024525.parquet
100 1 a 2015-01-01
20220615024626 20220615024626_1_3 id:100
creation_date=2015-01-02
6c1dbd2d-5db5-4c65-b180-f1d9561cf637-0_1-92-39217_20220615024626.parquet
100 2 b 2015-01-02
**Expected behavior**
It should not have any duplicate values and also update values in same
partition.
**Environment Description**
* Hudi version : hudi-spark-bundle_2.11-0.7.0-amzn-1.jar
* Spark version : version 2.4.7-amzn-1
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]