eshu opened a new issue, #5689: URL: https://github.com/apache/hudi/issues/5689
Dataset has two stages: the initial upload from a snapshot (insert operations) and after that updates happens on demand from Kafka (upserts). I checked the snapshot, it does not contain any duplicates. But on the second stage some duplicates appear. In case of duplicates dataset has 2 records with the same key in the same partition (files are different). The first record is from snapshot load, se second one is upserted from Kafka. It looks like upserts do not overwrite data from the snapshot in some cases. There is no such problem for small datasets, it appears on a big one. The number of duplicates is not big: ~1000 for ~100000 upserted records. Options roughly are ``` DataSourceWriteOptions.TABLE_TYPE -> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, DataSourceWriteOptions.PRECOMBINE_FIELD -> "internal_ts", FileSystemViewStorageConfig.INCREMENTAL_TIMELINE_SYNC_ENABLE -> false, DataSourceWriteOptions.HIVE_STYLE_PARTITIONING -> true, HoodieCompactionConfig.CLEANER_INCREMENTAL_MODE_ENABLE -> true, HoodieCompactionConfig.CLEANER_POLICY -> HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS, HoodieCompactionConfig.CLEANER_FILE_VERSIONS_RETAINED -> 3, DataSourceWriteOptions.ASYNC_COMPACT_ENABLE -> false, HoodieCompactionConfig.INLINE_COMPACT -> true, HoodiePayloadConfig.EVENT_TIME_FIELD -> "internal_ts", HoodiePayloadConfig.ORDERING_FIELD -> "internal_ts", DataSourceWriteOptions.PAYLOAD_CLASS_NAME -> "org.apache.hudi.common.model.EventTimeAvroPayload" ``` Data sample as CSV: ``` _hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name,internal_ts,event_type 20220524104142181,20220524104142181_2_5302363,202713158,date=2021-05-22,eafb523b-6e06-467f-aa73-59f2a6f1b0a7-0_2-4468-297519_20220524104142181.parquet,0,0 20220524110134548,20220524110134548_5_7,202713158,date=2021-05-22,4c6c650e-ed42-4ce2-a663-5f84ed919bd4-0_5-46-1312_20220524110134548.parquet,1653177858000,1 20220524104142181,20220524104142181_2_5301697,202720884,date=2021-05-22,eafb523b-6e06-467f-aa73-59f2a6f1b0a7-0_2-4468-297519_20220524104142181.parquet,0,0 20220524110134548,20220524110134548_5_5,202720884,date=2021-05-22,4c6c650e-ed42-4ce2-a663-5f84ed919bd4-0_5-46-1312_20220524110134548.parquet,1653182904000,1 20220524104142181,20220524104142181_2_5301713,202725262,date=2021-05-22,eafb523b-6e06-467f-aa73-59f2a6f1b0a7-0_2-4468-297519_20220524104142181.parquet,0,0 20220524110134548,20220524110134548_5_4,202725262,date=2021-05-22,4c6c650e-ed42-4ce2-a663-5f84ed919bd4-0_5-46-1312_20220524110134548.parquet,1653185666000,1 20220524104142181,20220524104142181_2_5301843,202732411,date=2021-05-22,eafb523b-6e06-467f-aa73-59f2a6f1b0a7-0_2-4468-297519_20220524104142181.parquet,0,0 20220524110134548,20220524110134548_5_6,202732411,date=2021-05-22,4c6c650e-ed42-4ce2-a663-5f84ed919bd4-0_5-46-1312_20220524110134548.parquet,1653190648000,1 20220524104142181,20220524104142181_2_5301968,202743505,date=2021-05-22,eafb523b-6e06-467f-aa73-59f2a6f1b0a7-0_2-4468-297519_20220524104142181.parquet,0,0 20220524110134548,20220524110134548_5_3,202743505,date=2021-05-22,4c6c650e-ed42-4ce2-a663-5f84ed919bd4-0_5-46-1312_20220524110134548.parquet,1653198094000,1 20220524104142181,20220524104142181_2_5302039,202761336,date=2021-05-22,eafb523b-6e06-467f-aa73-59f2a6f1b0a7-0_2-4468-297519_20220524104142181.parquet,0,0 20220524110134548,20220524110134548_5_2,202761336,date=2021-05-22,4c6c650e-ed42-4ce2-a663-5f84ed919bd4-0_5-46-1312_20220524110134548.parquet,1653210043000,1 20220524104142181,20220524104142181_7_5217271,202986883,date=2021-05-24,ecca9f33-8691-4d85-b56b-5e5bfcf7d6a9-0_7-4470-297537_20220524104142181.parquet,0,0 20220524110134548,20220524110134548_29_13514,202986883,date=2021-05-24,4a5d1ec9-f67f-4db2-aa2d-3c169f35450c-0_29-52-1335_20220524110134548.parquet,1653350461000,1 20220524104142181,20220524104142181_7_5217354,202987578,date=2021-05-24,ecca9f33-8691-4d85-b56b-5e5bfcf7d6a9-0_7-4470-297537_20220524104142181.parquet,0,0 20220524110134548,20220524110134548_29_13380,202987578,date=2021-05-24,4a5d1ec9-f67f-4db2-aa2d-3c169f35450c-0_29-52-1335_20220524110134548.parquet,1653350881000,1 20220524104142181,20220524104142181_7_5217375,202987648,date=2021-05-24,ecca9f33-8691-4d85-b56b-5e5bfcf7d6a9-0_7-4470-297537_20220524104142181.parquet,0,0 20220524110134548,20220524110134548_29_13589,202987648,date=2021-05-24,4a5d1ec9-f67f-4db2-aa2d-3c169f35450c-0_29-52-1335_20220524110134548.parquet,1653350882000,1 20220524104142181,20220524104142181_7_5217449,202988003,date=2021-05-24,ecca9f33-8691-4d85-b56b-5e5bfcf7d6a9-0_7-4470-297537_20220524104142181.parquet,0,0 20220524110134548,20220524110134548_29_13221,202988003,date=2021-05-24,4a5d1ec9-f67f-4db2-aa2d-3c169f35450c-0_29-52-1335_20220524110134548.parquet,1653351062000,1 20220524104142181,20220524104142181_7_5217496,202988323,date=2021-05-24,ecca9f33-8691-4d85-b56b-5e5bfcf7d6a9-0_7-4470-297537_20220524104142181.parquet,0,0 20220524110134548,20220524110134548_29_13425,202988323,date=2021-05-24,4a5d1ec9-f67f-4db2-aa2d-3c169f35450c-0_29-52-1335_20220524110134548.parquet,1653351242000,1 ``` In this sample of data event_type with value 0 corresponds to inserted values and 1 for upserted ones. The field "internal_ts" is used as an ordering field. How could I solve this issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
