eshu opened a new issue, #5689:
URL: https://github.com/apache/hudi/issues/5689

   Dataset has two stages: the initial upload from a snapshot (insert 
operations) and after that updates happens on demand from Kafka (upserts).
   
   I checked the snapshot, it does not contain any duplicates. But on the 
second stage some duplicates appear. In case of duplicates dataset has 2 
records with the same key in the same partition (files are different). The 
first record is from snapshot load, se second one is upserted from Kafka. It 
looks like upserts do not overwrite data from the snapshot in some cases. There 
is no such problem for small datasets, it appears on a big one.
   
   The number of duplicates is not big: ~1000 for ~100000 upserted records.
   
   Options roughly are
   ```
   DataSourceWriteOptions.TABLE_TYPE -> 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL,
   DataSourceWriteOptions.PRECOMBINE_FIELD -> "internal_ts",
   FileSystemViewStorageConfig.INCREMENTAL_TIMELINE_SYNC_ENABLE -> false,
   DataSourceWriteOptions.HIVE_STYLE_PARTITIONING -> true,
   HoodieCompactionConfig.CLEANER_INCREMENTAL_MODE_ENABLE -> true,
   HoodieCompactionConfig.CLEANER_POLICY -> 
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS,
   HoodieCompactionConfig.CLEANER_FILE_VERSIONS_RETAINED -> 3,
   DataSourceWriteOptions.ASYNC_COMPACT_ENABLE -> false,
   HoodieCompactionConfig.INLINE_COMPACT -> true,
   HoodiePayloadConfig.EVENT_TIME_FIELD -> "internal_ts",
   HoodiePayloadConfig.ORDERING_FIELD -> "internal_ts",
   DataSourceWriteOptions.PAYLOAD_CLASS_NAME -> 
"org.apache.hudi.common.model.EventTimeAvroPayload"
   ```
   
   Data sample as CSV:
   ```
   
_hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name,internal_ts,event_type
   
20220524104142181,20220524104142181_2_5302363,202713158,date=2021-05-22,eafb523b-6e06-467f-aa73-59f2a6f1b0a7-0_2-4468-297519_20220524104142181.parquet,0,0
   
20220524110134548,20220524110134548_5_7,202713158,date=2021-05-22,4c6c650e-ed42-4ce2-a663-5f84ed919bd4-0_5-46-1312_20220524110134548.parquet,1653177858000,1
   
20220524104142181,20220524104142181_2_5301697,202720884,date=2021-05-22,eafb523b-6e06-467f-aa73-59f2a6f1b0a7-0_2-4468-297519_20220524104142181.parquet,0,0
   
20220524110134548,20220524110134548_5_5,202720884,date=2021-05-22,4c6c650e-ed42-4ce2-a663-5f84ed919bd4-0_5-46-1312_20220524110134548.parquet,1653182904000,1
   
20220524104142181,20220524104142181_2_5301713,202725262,date=2021-05-22,eafb523b-6e06-467f-aa73-59f2a6f1b0a7-0_2-4468-297519_20220524104142181.parquet,0,0
   
20220524110134548,20220524110134548_5_4,202725262,date=2021-05-22,4c6c650e-ed42-4ce2-a663-5f84ed919bd4-0_5-46-1312_20220524110134548.parquet,1653185666000,1
   
20220524104142181,20220524104142181_2_5301843,202732411,date=2021-05-22,eafb523b-6e06-467f-aa73-59f2a6f1b0a7-0_2-4468-297519_20220524104142181.parquet,0,0
   
20220524110134548,20220524110134548_5_6,202732411,date=2021-05-22,4c6c650e-ed42-4ce2-a663-5f84ed919bd4-0_5-46-1312_20220524110134548.parquet,1653190648000,1
   
20220524104142181,20220524104142181_2_5301968,202743505,date=2021-05-22,eafb523b-6e06-467f-aa73-59f2a6f1b0a7-0_2-4468-297519_20220524104142181.parquet,0,0
   
20220524110134548,20220524110134548_5_3,202743505,date=2021-05-22,4c6c650e-ed42-4ce2-a663-5f84ed919bd4-0_5-46-1312_20220524110134548.parquet,1653198094000,1
   
20220524104142181,20220524104142181_2_5302039,202761336,date=2021-05-22,eafb523b-6e06-467f-aa73-59f2a6f1b0a7-0_2-4468-297519_20220524104142181.parquet,0,0
   
20220524110134548,20220524110134548_5_2,202761336,date=2021-05-22,4c6c650e-ed42-4ce2-a663-5f84ed919bd4-0_5-46-1312_20220524110134548.parquet,1653210043000,1
   
20220524104142181,20220524104142181_7_5217271,202986883,date=2021-05-24,ecca9f33-8691-4d85-b56b-5e5bfcf7d6a9-0_7-4470-297537_20220524104142181.parquet,0,0
   
20220524110134548,20220524110134548_29_13514,202986883,date=2021-05-24,4a5d1ec9-f67f-4db2-aa2d-3c169f35450c-0_29-52-1335_20220524110134548.parquet,1653350461000,1
   
20220524104142181,20220524104142181_7_5217354,202987578,date=2021-05-24,ecca9f33-8691-4d85-b56b-5e5bfcf7d6a9-0_7-4470-297537_20220524104142181.parquet,0,0
   
20220524110134548,20220524110134548_29_13380,202987578,date=2021-05-24,4a5d1ec9-f67f-4db2-aa2d-3c169f35450c-0_29-52-1335_20220524110134548.parquet,1653350881000,1
   
20220524104142181,20220524104142181_7_5217375,202987648,date=2021-05-24,ecca9f33-8691-4d85-b56b-5e5bfcf7d6a9-0_7-4470-297537_20220524104142181.parquet,0,0
   
20220524110134548,20220524110134548_29_13589,202987648,date=2021-05-24,4a5d1ec9-f67f-4db2-aa2d-3c169f35450c-0_29-52-1335_20220524110134548.parquet,1653350882000,1
   
20220524104142181,20220524104142181_7_5217449,202988003,date=2021-05-24,ecca9f33-8691-4d85-b56b-5e5bfcf7d6a9-0_7-4470-297537_20220524104142181.parquet,0,0
   
20220524110134548,20220524110134548_29_13221,202988003,date=2021-05-24,4a5d1ec9-f67f-4db2-aa2d-3c169f35450c-0_29-52-1335_20220524110134548.parquet,1653351062000,1
   
20220524104142181,20220524104142181_7_5217496,202988323,date=2021-05-24,ecca9f33-8691-4d85-b56b-5e5bfcf7d6a9-0_7-4470-297537_20220524104142181.parquet,0,0
   
20220524110134548,20220524110134548_29_13425,202988323,date=2021-05-24,4a5d1ec9-f67f-4db2-aa2d-3c169f35450c-0_29-52-1335_20220524110134548.parquet,1653351242000,1
   ```
   In this sample of data event_type with value 0 corresponds to inserted 
values and 1 for upserted ones. The field "internal_ts" is used as an ordering 
field.
   
   How could I solve this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to