[ 
https://issues.apache.org/jira/browse/HUDI-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4738:
----------------------------------
    Component/s: writer-core

> [MOR] Bloom Index missing new records inserted into Log files
> -------------------------------------------------------------
>
>                 Key: HUDI-4738
>                 URL: https://issues.apache.org/jira/browse/HUDI-4738
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: writer-core
>            Reporter: Alexey Kudinkin
>            Priority: Blocker
>             Fix For: 0.13.0
>
>
> Currently, Bloom Index is implemented under following assumption that 
> _file-group (once written), has fixed set of records that could not be 
> expanded_ (this is encoded t/h assumption that at least one version of every 
> record w/in the file group is stored w/in its base file).
> This is relied upon when we tag incoming records w/ the locations of the 
> file-groups they could potentially belong to (in case, when such records are 
> updates), by fetching the Bloom Index info from either a) base-file or b) 
> record in MT Bloom Index associated w/ particular file-group id.
>  
> However this assumption is not always true, since it's possible for _new_ 
> records to be inserted into the log-files, which would mean that the records 
> key-set of a single file-group could expand. This could lead to potentially 
> some records that were previously written to log-files to be duplicated.
>  
> We need to reconcile these 2 aspects and do either of:
>  # Disallow expansion of the file-group records' set (by not allowing inserts 
> into log-files)
>  # Fix Bloom Index implementation to also check log-files during tagging.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to