[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

steveloughran Mon, 20 Mar 2017 14:54:53 -0700

Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    The latest patch embraces the fact that 2.6 is the base hadoop version so 
the `hadoop-aws` JAR is always pulled in, dependencies set up. One thing to 
bear in mind here that the [Phase I 
fixes|https://issues.apache.org/jira/browse/HADOOP-11571] aren't in there, And 
s3a absolutely must not be used in production, the big killers being:
    
    * [HADOOP-11570](https://issues.apache.org/jira/browse/HADOOP-11570) 
closing the stream reads to the EOF, which means every seek() can read in 2x 
file size.
    * [HADOOP-11584](https://issues.apache.org/jira/browse/HADOOP-11584) block 
size returned in `getFileStatus()` ==0. That is bad because both Pig and Spark 
use that block size in partitioning, so will split up a file into single byte 
partitions: 20MB file, 2*10^7 tasks. Each of which will open the file at byte 
(0), then call seek to offset, then close(). As a result, 2*10e7 * tasks 
reading 2* 2 2 * 10e7 bytes. This is generally considered "pathologically 
suboptimal". I've had to modify my downstream tests to recognise when the block 
size of a file ==0 and skip those tests.
    
    s3n will work; in 2.6 it moved to the aws JAR, so reinstate the 
functionality which was in spark builds against hadoop 2.2-2.5



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Reply via email to