hudi-bot opened a new issue, #15399:
URL: https://github.com/apache/hudi/issues/15399
Currently, Bloom Index is implemented under following assumption that
_file-group (once written), has fixed set of records that could not be
expanded_ (this is encoded t/h assumption that at least one version of every
record w/in the file group is stored w/in its base file).
This is relied upon when we tag incoming records w/ the locations of the
file-groups they could potentially belong to (in case, when such records are
updates), by fetching the Bloom Index info from either a) base-file or b)
record in MT Bloom Index associated w/ particular file-group id.
However this assumption is not always true, since it's possible for _new_
records to be inserted into the log-files, which would mean that the records
key-set of a single file-group could expand. This could lead to potentially
some records that were previously written to log-files to be duplicated.
We need to reconcile these 2 aspects and do either of:
# Disallow expansion of the file-group records' set (by not allowing
inserts into log-files)
# Fix Bloom Index implementation to also check log-files during tagging.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-4738
- Type: Bug
- Fix version(s):
- 1.1.0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]