SabyasachiDasTR commented on issue #4311:
URL: https://github.com/apache/hudi/issues/4311#issuecomment-1019997016


   Hi @nsivabalan & @prashantwason ,
   
   Adding to the original issue reported by @JohnEngelhart  and @sunilknataraj 
   I am from same organization and mentioning the same issue.
   
   Regarding the possible reasons for duplicates
   1. We are not using bulk_insert, but only upsert .[PFA Upsert query]
   2. No multi writer is involved.
   3. hoodiecombinebeforeupsert is default true.
   
   Below are the recent findings
   We could replicate the issue for a new dataset with below hoodie 
configurations, when upserting to a new table.[PFA]
   "hoodie.index.type" -> "SIMPLE",
   "hoodie.metadata.enable" -> "true",
   
   Below combination are not producing duplicates when upserting to a new table.
   Index        metadata
   SIMPLE       FALSE
   BLOOM        FALSE   
   BLOOM        TRUE
   
   
   However when we are using same table which had duplicates originally and 
updating the Index & metadata configuration,
   these combinations are still causing duplicates with latest consumed data.
   Our data can be of small size and large size datasets and have a high 
incremental updates.
   As suggested we updated Index = BLOOM and metadata = false.
   Observed no duplicate for a new table and fresh dataset but duplicates are 
created on same table for which issue was reported.
   Compactions is working as expected with inline type but we are seeing a lot 
of log files generated in partition table along with the data parquet files.
   
   Deleting existing table and Re-ingesting the whole data may be an option to 
evaluate but is costly for us.
   Please suggest if any way possible to get rid of the existing duplicates and 
avoid ingesting duplicates for existing table data.
   
   [PFA]
   1. .hoodie files attached
   2. hudiOptions used.
   3. Upsert query.
   
[hudiOptions.txt](https://github.com/apache/hudi/files/7925096/hudiOptions.txt)
   
[upsertQuery.txt](https://github.com/apache/hudi/files/7925097/upsertQuery.txt)
   
   
[hoodie_folder_SIMPLE_META_Enabled.zip](https://github.com/apache/hudi/files/7925088/hoodie_folder_SIMPLE_META_Enabled.zip)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to