Re: [PR] [SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader [spark]

via GitHub Tue, 02 Jun 2026 02:05:06 -0700


pan3793 commented on PR #50765:
URL: https://github.com/apache/spark/pull/50765#issuecomment-4600739008


   > these optimisations (passing in the filestatus and re-using the same 
stream for footer + data reads) are very useful for cloud connectors.
   
   @ahmarsuhail the latter is split into 
https://github.com/apache/spark/pull/52384 and landed in Spark 4.1.0, since I 
don't have experience with cloud storage services, I may not be able to 
evaluate the benefit.
   
   for the remaining part mentioned in 
https://github.com/apache/spark/pull/50765#discussion_r2357607758
   
   > constructing FileStatus from the executor side directly
   
   this requires a broad testing over different storage backends, I'm not sure 
if a basic `FileStatus` with only file path and offset/length is sufficient for 
all storage


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader [spark]

Reply via email to