Github user steveloughran commented on the issue:
https://github.com/apache/spark/pull/12004
The latest patch embraces the fact that 2.6 is the base hadoop version so
the `hadoop-aws` JAR is always pulled in, dependencies set up. One thing to
bear in mind here that the [Phase I
fixes|https://issues.apache.org/jira/browse/HADOOP-11571] aren't in there, And
s3a absolutely must not be used in production, the big killers being:
* [HADOOP-11570](https://issues.apache.org/jira/browse/HADOOP-11570)
closing the stream reads to the EOF, which means every seek() can read in 2x
file size.
* [HADOOP-11584](https://issues.apache.org/jira/browse/HADOOP-11584) block
size returned in `getFileStatus()` ==0. That is bad because both Pig and Spark
use that block size in partitioning, so will split up a file into single byte
partitions: 20MB file, 2*10^7 tasks. Each of which will open the file at byte
(0), then call seek to offset, then close(). As a result, 2*10e7 * tasks
reading 2* 2 2 * 10e7 bytes. This is generally considered "pathologically
suboptimal". I've had to modify my downstream tests to recognise when the block
size of a file ==0 and skip those tests.
s3n will work; in 2.6 it moved to the aws JAR, so reinstate the
functionality which was in spark builds against hadoop 2.2-2.5
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]