Re: [PR] Bloom improvements [hudi]

via GitHub Sun, 28 Jan 2024 10:18:33 -0800


the-other-tim-brown commented on PR #10578:
URL: https://github.com/apache/hudi/pull/10578#issuecomment-1913680799


   Open questions:
   I see that the code is currently loading the ranges for all the files in all 
of the affected partitions into a single object for the range filtering. Should 
we try to leverage spark limit the evaluation to a single partition or some 
cluster of files within that partition? 
   
   Does it make sense to pull the evaluation of the bloom filter check into 
this step as well? Right now we'll read the footers twice but if we can create 
a cluster of files for each key to evaluate against, we could just read the 
range and bloom filter at the same time and do the evaluation then for the 
files that the key may be a part of. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Bloom improvements [hudi]

Reply via email to