nsivabalan edited a comment on pull request #1469:
URL: https://github.com/apache/hudi/pull/1469#issuecomment-653546494


   @lamber-ken @vinothchandar : I took a stab at the global bloom index V2. I 
don't have permissions to lamberken's repo and hence couldn't update his 
branch. Here is my 
[branch](https://github.com/nsivabalan/hudi/tree/bloomIndexV2) and 
[commit](https://github.com/nsivabalan/hudi/commit/7f59a67743bbeee162181e2a2ca725fe9656cb8f)
 link. And 
[here](https://github.com/nsivabalan/hudi/commit/7f59a67743bbeee162181e2a2ca725fe9656cb8f#diff-fa376d426f0652ffeb1e1f807795196e)
 is the link to the GlobalBloomIndexV2. Please check it out. Have added and 
fixed tests for the same. 
   
   Also, I have two questions/clarifications.
   1: with regular bloom index V2, why do we need to sort based on both 
partition path and record keys. Why not just partition path suffice? 
   2: Correct me if I am wrong. But there is one corner case where both bloom 
index V2 and global version needs to be fixed. But it might incur an additional 
left outer join. So, wanted to confirm if its feasible. 
   Let's say for an incoming record, there is 1 or more files returned after 
range and bloom look up. But in key checker, lets say none of the files had the 
record key. In this scenario, the output of tag location may not have the 
record only. 
   
   If this is a feasible case, then the fix I could think of is.
   Do not return empty candidates from LazyRangeAndBloomChecker. So that result 
after LazyKeyChecker will not contain such records. With this fix, 
LazyKeyChecker will return only existing records in storage. Once we have the 
result from LazyKeyChecker, we might have to do left outer join with incoming 
records to find those non existent records and add them to final tagged record 
list. 
   
   Similar fix needs to be done with global version as well. 
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to