yihua opened a new pull request, #7379:
URL: https://github.com/apache/hudi/pull/7379

   ### Change Logs
   
   Before this PR, Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported 
for virtual keys.
   
   This PR adds support for virtual keys in the non-global and global Bloom 
Index.  Two major issues that are fixed:
   - When looking up record keys in a file group, the `HoodieKeyLookupHandle` 
first checks the record key against the bloom filter to construct a set of 
candidates and then fetches all the record keys from the parquet file to filter 
the candidates and generate the final set of record keys.
     - When virtual keys are disabled (`hoodie.populate.meta.fields=true`), the 
record keys can be fetched by directly reading meta field of 
`_hoodie_record_key` from the parquet file.
     - When virtual keys are enabled (`hoodie.populate.meta.fields=false`), the 
record keys need to be generated on the fly based on the key generator, using 
the record key fields in the data schema specified by the user.  This is the 
new logic added in this PR.
   - Before this PR, when virtual keys are enabled 
(`hoodie.populate.meta.fields=false`), the bloom filters are not written to the 
footer in the parquet files.  This PR fixes the behavior so that bloom filters 
are always written to the parquet footers, regardless of whether the virtual 
keys are enabled.
   
   New unit tests are added to test the support for virtual keys in Bloom 
Index. 
   
   ### Impact
   
   Now with virtual keys enabled (`hoodie.populate.meta.fields=false`), Bloom 
Index is supported to unlock better upsert performance when needed.
   
   ### Risk level
   
   medium
   
   ### Documentation Update
   
   Update docs of virtual keys
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to