mandar-mw opened a new issue, #5442:
URL: https://github.com/apache/hudi/issues/5442

   HUDI does not seem to deduplicate records in some cases. Below is the 
configuration that we use. We partition the data by customer_id and our 
recordkey is [user_id, customer_id], so our expectation is that HUDI will 
enforce uniqueness within the partition, i.e each customer_id folder. Although, 
we are noticing that there are two parquet files inside some customer_id 
folders, and when we query the data in these partitions, we notice there are 
duplicate user_id in the same customer_id. The _hoodie_record_key is identical 
for the two duplicate records, but the _hoodie_file_name is different, which 
makes me suspect that hudi is enforcing uniqueness not in the customer_id 
folder, but in these individual parquet files. Can someone explain this 
behavior?
   
   ```
    op: "INSERT"
     target-base-path: "s3_path"
     target-table: "some_table_name"
   
     source-ordering-field: "created_at"
     transformer-class: 
"org.apache.hudi.utilities.transform.SqlQueryBasedTransformer"
   
     filter-dupes: ""
     hoodie_conf:
     # source table base path
     hoodie.deltastreamer.source.dfs.root: "s3_path"
   
     # record key, partition paths and keygenerator
     hoodie.datasource.write.recordkey.field: "user_id,customer_id"
     hoodie.datasource.write.partitionpath.field: "customer_id"
     hoodie.datasource.write.keygenerator.class: 
     "org.apache.hudi.keygen.ComplexKeyGenerator"
   
     # hive sync properties
     hoodie.datasource.hive_sync.enable: true
     hoodie.datasource.hive_sync.table: "table_name"
     hoodie.datasource.hive_sync.database: "database_name"
     hoodie.datasource.hive_sync.partition_fields: "customer_id"
     hoodie.datasource.hive_sync.partition_extractor_class: 
     "org.apache.hudi.hive.MultiPartKeysValueExtractor"
     hoodie.datasource.write.hive_style_partitioning: true
   
     # sql transformer
     hoodie.deltastreamer.transformer.sql: "SELECT user_id, customer_id, 
updated_at as 
     created_at FROM <SRC> a"
   
     # since there is no dt partition, the following config from default has to 
be 
     overridden
     hoodie.deltastreamer.source.dfs.datepartitioned.selector.depth: 0
   ```
   
   Here is an example of duplicate records
   
   ```
   
   
   _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | 
_hoodie_partition_path | _hoodie_file_name | user_record_id | created_at | org
   -- | -- | -- | -- | -- | -- | -- | --
   20220316201026 | 20220316201026_95_35511 | 
user_id:<redacted>,customer_id:<redacted> | customer_id=<redacted> | 
4a17e6ec-8f53-4a68-8878-6c8d6c4e2583-0_95-26-3087_20220316201026.parquet | 
<redacted> | 2020-03-24 05:03:53.016406+00 | <redacted>
   20220315225025 | 20220315225025_81_28979 | 
user_id:<redacted>,customer_id:<redacted> | customer_id=<redacted> | 
52631482-5c9b-4f84-97c1-1e5ab232b1de-0_81-26-8091_20220315225025.parquet | 
<redacted> | 2022-03-15 15:32:29.325168 | <redacted>
   
   
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to