Alexey Kudinkin created HUDI-4738:
-------------------------------------

             Summary: [MOR] Bloom Index missing new records inserted into Log 
files
                 Key: HUDI-4738
                 URL: https://issues.apache.org/jira/browse/HUDI-4738
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Alexey Kudinkin
             Fix For: 0.13.0


Currently, Bloom Index is implemented under following assumption that 
_file-group (once written), has fixed set of records that could not be 
expanded_ (this is encoded t/h assumption that at least one version of every 
record w/in the file group is stored w/in its base file).

This is relied upon when we tag incoming records w/ the locations of the 
file-groups they could potentially belong to (in case, when such records are 
updates), by fetching the Bloom Index info from either a) base-file or b) 
record in MT Bloom Index associated w/ particular file-group id.

 

However this assumption is not always true, since it's possible for _new_ 
records to be inserted into the log-files, which would mean that the records 
key-set of a single file-group could expand. This could lead to potentially 
some records that were previously written to log-files to be duplicated.

 

We need to reconcile these 2 aspects and do either of:
 # Disallow expansion of the file-group records' set (by not allowing inserts 
into log-files)
 # Fix Bloom Index implementation to also check log-files during tagging.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to