kbuci opened a new pull request, #10540:
URL: https://github.com/apache/hudi/pull/10540

   
   ### Change Logs
   
   - Update AbstractHoodieLogRecordReader and implementations to accept a 
HoodieTableMetaClient directly, and skip constructing one if provided
   -- 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader#AbstractHoodieLogRecordReader
   -- org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner
   -- org.apache.hudi.common.table.log.HoodieUnMergedLogRecordScanner
   
   - Update Spark engine-related callers of AbstractHoodieLogRecordReader to 
pass in an already constructed HoodieTableMetaClient, if feasible and 
applicable. Specifically, for cases where the Spark driver would already have a 
HoodieTableMetaClient created, and would a launch a Spark stage where each task 
would be creating/using a AbstractHoodieLogRecordReader instance. 
   
   See HUDI-7316 for context
   
   ### Impact
   Currently, when using Spark engine, there are cases where each Spark task 
needs to construct/use a AbstractHoodieLogRecordReader instance, and while 
doing so will create a HoodieMetaClient and will read the active timeline for 
it. This causes a file listing call to the Distributed file system service 
(such as HDFS namenode). If the caller (that initiated this Spark stage from 
the driver) can feasibly pass in a HoodieMetaClient (with active timeline 
already loaded) then allowing the caller to pass in this existing 
HoodieMetaClient to AbstractHoodieLogRecordReader implementations will avoid 
this unnecessary file listing call. For users that launch Spark jobs with 
hundreds or thousands of executors, this could potentially avoid hundreds or 
thousands of file listing calls (which would likely happen around the same 
time, since all these tasks would be initiated in the same spark stage).
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to