> On 12 Jun 2015, at 17:12, Patrick Wendell <pwend...@gmail.com> wrote:
> 
>  For instance at Databricks we use
> the FileSystem library for talking to S3... every time we've tried to
> upgrade to Hadoop 2.X there have been significant regressions in
> performance and we've had to downgrade. That's purely anecdotal, but I
> think you have people out there using the Hadoop 1 bindings for whom
> upgrade would be a pain.

ah s3n. The unloved orphan FS, which has been fairly neglected as being 
non-strategic to anyone but Amazon, who have a private fork. 

s3n broke in hadopo 2.4 where the upgraded Jets3t went in with some patch which 
swallowed exceptions (nobody should ever do that) and as result would NPE on a 
seek(0) of a file of length(0). HADOOP-10457. Fixed in Hadoop 2.5

Hadoop 2.6 has left S3n on maintenance out of fear of breaking more things, 
future work is in s3a:,, which switched to the amazon awstoolkit JAR and moved 
the implementation to hadoop-aws JAR. S3a promises: speed, partitioned upload, 
better auth. 

But: it's not ready for serious use in Hadoop 2.6, so don't try. You need the 
Hadoop 2.7 patches, which are in ASF Hadoop 2.7, will be in HDP2.3, and have 
been picked up in CDH5.3. (HADOOP-11571). For Spark, the fact that the block 
size is being returned as 0 in getFileStatus() could be the killer.

Future work is going to improve performance and scale ( HADOOP-11694 )

Now, if spark is finding problems with s3a performance, tests for this would be 
great -complaints on JIRAs too. There's not enough functional testing of 
analytics workloads against the object stores, especially s3 and swift. If 
someone volunteers to add some optional test module for object store testing, 
I'll help review it and suggest some tests to generate stress

That can be done without the leap to Hadoop 2 —though the proposed HADOOP-9565 
work allowing object stores to declare that they are and publish some of their 
consistency and atomicity semantics will be Hadoop 2.8+. If you want your 
output committers to recognise when the destination is an eventually constitent 
object store with O(n) directory rename and delete, that's where the code will 
be.

Reply via email to