Alexey Kudinkin created HUDI-4738:
-------------------------------------
Summary: [MOR] Bloom Index missing new records inserted into Log
files
Key: HUDI-4738
URL: https://issues.apache.org/jira/browse/HUDI-4738
Project: Apache Hudi
Issue Type: Bug
Reporter: Alexey Kudinkin
Fix For: 0.13.0
Currently, Bloom Index is implemented under following assumption that
_file-group (once written), has fixed set of records that could not be
expanded_ (this is encoded t/h assumption that at least one version of every
record w/in the file group is stored w/in its base file).
This is relied upon when we tag incoming records w/ the locations of the
file-groups they could potentially belong to (in case, when such records are
updates), by fetching the Bloom Index info from either a) base-file or b)
record in MT Bloom Index associated w/ particular file-group id.
However this assumption is not always true, since it's possible for _new_
records to be inserted into the log-files, which would mean that the records
key-set of a single file-group could expand. This could lead to potentially
some records that were previously written to log-files to be duplicated.
We need to reconcile these 2 aspects and do either of:
# Disallow expansion of the file-group records' set (by not allowing inserts
into log-files)
# Fix Bloom Index implementation to also check log-files during tagging.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)