mandar-mw opened a new issue, #5442:
URL: https://github.com/apache/hudi/issues/5442
HUDI does not seem to deduplicate records in some cases. Below is the
configuration that we use. We partition the data by customer_id and our
recordkey is [user_id, customer_id], so our expectation is that HUDI will
enforce uniqueness within the partition, i.e each customer_id folder. Although,
we are noticing that there are two parquet files inside some customer_id
folders, and when we query the data in these partitions, we notice there are
duplicate user_id in the same customer_id. The _hoodie_record_key is identical
for the two duplicate records, but the _hoodie_file_name is different, which
makes me suspect that hudi is enforcing uniqueness not in the customer_id
folder, but in these individual parquet files. Can someone explain this
behavior?
```
op: "INSERT"
target-base-path: "s3_path"
target-table: "some_table_name"
source-ordering-field: "created_at"
transformer-class:
"org.apache.hudi.utilities.transform.SqlQueryBasedTransformer"
filter-dupes: ""
hoodie_conf:
# source table base path
hoodie.deltastreamer.source.dfs.root: "s3_path"
# record key, partition paths and keygenerator
hoodie.datasource.write.recordkey.field: "user_id,customer_id"
hoodie.datasource.write.partitionpath.field: "customer_id"
hoodie.datasource.write.keygenerator.class:
"org.apache.hudi.keygen.ComplexKeyGenerator"
# hive sync properties
hoodie.datasource.hive_sync.enable: true
hoodie.datasource.hive_sync.table: "table_name"
hoodie.datasource.hive_sync.database: "database_name"
hoodie.datasource.hive_sync.partition_fields: "customer_id"
hoodie.datasource.hive_sync.partition_extractor_class:
"org.apache.hudi.hive.MultiPartKeysValueExtractor"
hoodie.datasource.write.hive_style_partitioning: true
# sql transformer
hoodie.deltastreamer.transformer.sql: "SELECT user_id, customer_id,
updated_at as
created_at FROM <SRC> a"
# since there is no dt partition, the following config from default has to
be
overridden
hoodie.deltastreamer.source.dfs.datepartitioned.selector.depth: 0
```
Here is an example of duplicate records
```
_hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key |
_hoodie_partition_path | _hoodie_file_name | user_record_id | created_at | org
-- | -- | -- | -- | -- | -- | -- | --
20220316201026 | 20220316201026_95_35511 |
user_id:<redacted>,customer_id:<redacted> | customer_id=<redacted> |
4a17e6ec-8f53-4a68-8878-6c8d6c4e2583-0_95-26-3087_20220316201026.parquet |
<redacted> | 2020-03-24 05:03:53.016406+00 | <redacted>
20220315225025 | 20220315225025_81_28979 |
user_id:<redacted>,customer_id:<redacted> | customer_id=<redacted> |
52631482-5c9b-4f84-97c1-1e5ab232b1de-0_81-26-8091_20220315225025.parquet |
<redacted> | 2022-03-15 15:32:29.325168 | <redacted>
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]