sunchao commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-713055884
Thanks @steveloughran , for putting all the context on S3A/ABFS, and sorry
for the late comment.
> HEAD: don't call exists/getFileStatus/etc if you know a file is there.
It's also better to do a try { open() } catch (f: FileNotFoundException) than
it is for a probe + open, as youl will save a HEAD.
Good point. I guess this applies in general to other FS-impls as well. I can
spend some time checking Spark codebase for this pattern and find potential
improvements.
> The new openFile() API will let the caller specify seek policy
(sequential, random, adaptive,...) and, if you pass in the known file length,
skip the HEAD. It's also an async operation on S3A, even when a HEAD is needed.
Cool. This is in 3.3.0+ though. We can perhaps explore this once Spark made
the switch.
> LIST ...
Yes I wish Spark can benefit from the paged listing (I know Presto has
[optimizations](https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java)
around this and it seems to work really well for them) from `FileSystem` impls
but it will need some significant changes.
> What you can do right now is add an option to log the toString() values of
input streams/output streams/remoteiterators at debug level to some
performance-only log
Sounds good. Let me try to add these.
> I'd love to get some comparisons here.
I'll see what I can do ... without the incremental listing support I think
we won't see much difference here? plus I don't have a testing environment for
S3A at hand :(
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]