yihua opened a new pull request, #7379:
URL: https://github.com/apache/hudi/pull/7379
### Change Logs
Before this PR, Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported
for virtual keys.
This PR adds support for virtual keys in the non-global and global Bloom
Index. Two major issues that are fixed:
- When looking up record keys in a file group, the `HoodieKeyLookupHandle`
first checks the record key against the bloom filter to construct a set of
candidates and then fetches all the record keys from the parquet file to filter
the candidates and generate the final set of record keys.
- When virtual keys are disabled (`hoodie.populate.meta.fields=true`), the
record keys can be fetched by directly reading meta field of
`_hoodie_record_key` from the parquet file.
- When virtual keys are enabled (`hoodie.populate.meta.fields=false`), the
record keys need to be generated on the fly based on the key generator, using
the record key fields in the data schema specified by the user. This is the
new logic added in this PR.
- Before this PR, when virtual keys are enabled
(`hoodie.populate.meta.fields=false`), the bloom filters are not written to the
footer in the parquet files. This PR fixes the behavior so that bloom filters
are always written to the parquet footers, regardless of whether the virtual
keys are enabled.
New unit tests are added to test the support for virtual keys in Bloom
Index.
### Impact
Now with virtual keys enabled (`hoodie.populate.meta.fields=false`), Bloom
Index is supported to unlock better upsert performance when needed.
### Risk level
medium
### Documentation Update
Update docs of virtual keys
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]