[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

steveloughran Tue, 08 Dec 2015 10:37:02 -0800

Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/8512#issuecomment-162974033
  
    Has anyone looked at the performance of this versus S3a in Hadoop 2.7+? 
Because while I do agree this will dramatically improve s3n: and s3: perf, all 
ongoing Hadoop work is on the s3a FS, with s3n left alone on the grounds that 
every upgrade of jets3t or change breaks things. S3a does use {{ListRequest}} 
and I'd expect it to not only list faster, but have faster reads too.
    
    That doesn't mean this patch won't be useful: if anyone still uses s3: 
it'll be essential (there's no maintenance going on there), and the code here 
will also benefit hadoop <= 2.6. It's just for 2.7+ I would say "use s3a and be 
done with it". That said, there's lots of work on s3a which remains to be 
looked at, especially in lazy seeks.
    
    What could be very useful for the Hadoop team here is some tests for Spark 
using S3 so as to catch regressions in functionality, performance, scale
    
    1. Measure that ls() performance. Maybe we can find/get someone to create 
an s3 store pre-populated with many files.
    2. look at the costs of read + seek + close on big files. 
[HADOOP-12376](https://issues.apache.org/jira/browse/HADOOP-12376) turned out 
to be a surprise there: if you close() a multiGB file 3 bytes in, that close() 
still completes. Again, having some public reference files would aid testing 
here




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

Reply via email to