the-other-tim-brown commented on PR #10578: URL: https://github.com/apache/hudi/pull/10578#issuecomment-1913680799
Open questions: I see that the code is currently loading the ranges for all the files in all of the affected partitions into a single object for the range filtering. Should we try to leverage spark limit the evaluation to a single partition or some cluster of files within that partition? Does it make sense to pull the evaluation of the bloom filter check into this step as well? Right now we'll read the footers twice but if we can create a cluster of files for each key to evaluate against, we could just read the range and bloom filter at the same time and do the evaluation then for the files that the key may be a part of. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
