Re: [PR] [SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader [spark]

via GitHub Tue, 02 Jun 2026 01:47:40 -0700


ahmarsuhail commented on PR #50765:
URL: https://github.com/apache/spark/pull/50765#issuecomment-4600503814


   Agree with @steveloughran here, we should try and get this merged in. Both 
these optimisations (passing in the filestatus and re-using the same stream for 
footer + data reads) are very useful for cloud connectors. 
   
   The use of separate streams for footer and data reads means streams loose 
useful context in between the opens. For example, if on the first stream open 
we prefetch and cache the tail of the file which contains bloom filters and 
pageIndexes, by the time this data is actually needed by the data stream, we've 
lost it as the first stream was closed. Essentially, it means that any context 
and data must be cached outside of the life of the stream, which complicates 
things, as we don't know when to clear the cache. 
   
   Further the two separate opens means that we end up making 4 HEAD calls to 
S3, similar to the 4 RPC calls mentioned in this PR. All of this could be cut 
down and save both ~100s of ms per open, and $$ when working with cloud 
connectors. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader [spark]

Reply via email to