hudi-bot opened a new issue, #15399:
URL: https://github.com/apache/hudi/issues/15399

   Currently, Bloom Index is implemented under following assumption that 
_file-group (once written), has fixed set of records that could not be 
expanded_ (this is encoded t/h assumption that at least one version of every 
record w/in the file group is stored w/in its base file).
   
   This is relied upon when we tag incoming records w/ the locations of the 
file-groups they could potentially belong to (in case, when such records are 
updates), by fetching the Bloom Index info from either a) base-file or b) 
record in MT Bloom Index associated w/ particular file-group id.
   
    
   
   However this assumption is not always true, since it's possible for _new_ 
records to be inserted into the log-files, which would mean that the records 
key-set of a single file-group could expand. This could lead to potentially 
some records that were previously written to log-files to be duplicated.
   
    
   
   We need to reconcile these 2 aspects and do either of:
    # Disallow expansion of the file-group records' set (by not allowing 
inserts into log-files)
    # Fix Bloom Index implementation to also check log-files during tagging.
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-4738
   - Type: Bug
   - Fix version(s):
     - 1.1.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to