[ 
https://issues.apache.org/jira/browse/HUDI-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyaksh Sharma updated HUDI-796:
----------------------------------
    Status: Open  (was: New)

> Rewrite DedupeSparkJob.scala without considering the _hoodie_commit_time
> ------------------------------------------------------------------------
>
>                 Key: HUDI-796
>                 URL: https://issues.apache.org/jira/browse/HUDI-796
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>            Reporter: Pratyaksh Sharma
>            Assignee: Pratyaksh Sharma
>            Priority: Major
>
> _`_hoodie_commit_time` can only be used for deduping a partition path if 
> duplicates happened due to INSERT operation. In case of updates, bloom filter 
> tags both the files where a record is present for update, and all such files 
> will have the same `___hoodie_commit_time__` for a duplicate record 
> henceforth._ 
> _Hence it makes sense to rewrite this class without considering the metadata 
> field._ 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to