kbuci opened a new pull request, #10540: URL: https://github.com/apache/hudi/pull/10540
### Change Logs - Update AbstractHoodieLogRecordReader and implementations to accept a HoodieTableMetaClient directly, and skip constructing one if provided -- org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader#AbstractHoodieLogRecordReader -- org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner -- org.apache.hudi.common.table.log.HoodieUnMergedLogRecordScanner - Update Spark engine-related callers of AbstractHoodieLogRecordReader to pass in an already constructed HoodieTableMetaClient, if feasible and applicable. Specifically, for cases where the Spark driver would already have a HoodieTableMetaClient created, and would a launch a Spark stage where each task would be creating/using a AbstractHoodieLogRecordReader instance. See HUDI-7316 for context ### Impact Currently, when using Spark engine, there are cases where each Spark task needs to construct/use a AbstractHoodieLogRecordReader instance, and while doing so will create a HoodieMetaClient and will read the active timeline for it. This causes a file listing call to the Distributed file system service (such as HDFS namenode). If the caller (that initiated this Spark stage from the driver) can feasibly pass in a HoodieMetaClient (with active timeline already loaded) then allowing the caller to pass in this existing HoodieMetaClient to AbstractHoodieLogRecordReader implementations will avoid this unnecessary file listing call. For users that launch Spark jobs with hundreds or thousands of executors, this could potentially avoid hundreds or thousands of file listing calls (which would likely happen around the same time, since all these tasks would be initiated in the same spark stage). ### Risk level (write none, low medium or high below) low ### Documentation Update ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
