[
https://issues.apache.org/jira/browse/HUDI-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Kudinkin updated HUDI-4738:
----------------------------------
Component/s: writer-core
> [MOR] Bloom Index missing new records inserted into Log files
> -------------------------------------------------------------
>
> Key: HUDI-4738
> URL: https://issues.apache.org/jira/browse/HUDI-4738
> Project: Apache Hudi
> Issue Type: Bug
> Components: writer-core
> Reporter: Alexey Kudinkin
> Priority: Blocker
> Fix For: 0.13.0
>
>
> Currently, Bloom Index is implemented under following assumption that
> _file-group (once written), has fixed set of records that could not be
> expanded_ (this is encoded t/h assumption that at least one version of every
> record w/in the file group is stored w/in its base file).
> This is relied upon when we tag incoming records w/ the locations of the
> file-groups they could potentially belong to (in case, when such records are
> updates), by fetching the Bloom Index info from either a) base-file or b)
> record in MT Bloom Index associated w/ particular file-group id.
>
> However this assumption is not always true, since it's possible for _new_
> records to be inserted into the log-files, which would mean that the records
> key-set of a single file-group could expand. This could lead to potentially
> some records that were previously written to log-files to be duplicated.
>
> We need to reconcile these 2 aspects and do either of:
> # Disallow expansion of the file-group records' set (by not allowing inserts
> into log-files)
> # Fix Bloom Index implementation to also check log-files during tagging.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)