ahmarsuhail commented on PR #50765: URL: https://github.com/apache/spark/pull/50765#issuecomment-4600503814
Agree with @steveloughran here, we should try and get this merged in. Both these optimisations (passing in the filestatus and re-using the same stream for footer + data reads) are very useful for cloud connectors. The use of separate streams for footer and data reads means streams loose useful context in between the opens. For example, if on the first stream open we prefetch and cache the tail of the file which contains bloom filters and pageIndexes, by the time this data is actually needed by the data stream, we've lost it as the first stream was closed. Essentially, it means that any context and data must be cached outside of the life of the stream, which complicates things, as we don't know when to clear the cache. Further the two separate opens means that we end up making 4 HEAD calls to S3, similar to the 4 RPC calls mentioned in this PR. All of this could be cut down and save both ~100s of ms per open, and $$ when working with cloud connectors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
