Github user steveloughran commented on the pull request:
https://github.com/apache/spark/pull/8512#issuecomment-162974033
Has anyone looked at the performance of this versus S3a in Hadoop 2.7+?
Because while I do agree this will dramatically improve s3n: and s3: perf, all
ongoing Hadoop work is on the s3a FS, with s3n left alone on the grounds that
every upgrade of jets3t or change breaks things. S3a does use {{ListRequest}}
and I'd expect it to not only list faster, but have faster reads too.
That doesn't mean this patch won't be useful: if anyone still uses s3:
it'll be essential (there's no maintenance going on there), and the code here
will also benefit hadoop <= 2.6. It's just for 2.7+ I would say "use s3a and be
done with it". That said, there's lots of work on s3a which remains to be
looked at, especially in lazy seeks.
What could be very useful for the Hadoop team here is some tests for Spark
using S3 so as to catch regressions in functionality, performance, scale
1. Measure that ls() performance. Maybe we can find/get someone to create
an s3 store pre-populated with many files.
2. look at the costs of read + seek + close on big files.
[HADOOP-12376](https://issues.apache.org/jira/browse/HADOOP-12376) turned out
to be a surprise there: if you close() a multiGB file 3 bytes in, that close()
still completes. Again, having some public reference files would aid testing
here
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]