GitHub user steveloughran opened a pull request:
https://github.com/apache/spark/pull/12004
[SPARK-7481][build][WIP] Add Hadoop 2.6+ profile to pull in object store FS
accesors
## What changes were proposed in this pull request?
[SPARK-7481] Add Hadoop 2.6+ profile to pull in object store FS accessors
in hadoop-openstack, hadoop-aws and hadoop-azure.
As a result, it gets the Hadoop classes s3n:// support back into
spark-assembly; adds s3a and openstack on Hadoop 2.6, and on Hadoop 2.7 adds
azure support. It does not add the dependencies needed for s3a or azure
- spark-assembly has an explicit dependency on jets3t; this is used by s3n
- s3a needs a (large) amazon-aws JAR in Hadoop 2.6; Hadoop 2.7 has switched
to a leaner amazon-aws-s3 JAR.
- azure needs a microsoft azure storage JAR
- openstack reuses JARs already in the assembly and adds one: commons-io.
This is not excluded from spark-assembly, though it is easy to add that.
The patch defines a new module, "cloud" with transitive dependencies on the
amazon (hadoop 2.6+) and azure (hadoop 2.7+) JARs. The spark assembly JAR pulls
in spark-cloud and its hadoop dependencies (scoped at hadoop-provided) âbut
excludes those external dependencies.
Having an explicit module allows for followup work, specifically some
tests. It also enables downstream applications to declare their dependency upon
`spark-cloud` and get the the object store accessors (and anything else people
choose to add in future)
## How was this patch tested?
The dependency logic was verified via maven dependency checking; the
inclusion of the hadoop code and exclusion of com.microsoft and com.amazon
files checked by examining the contents of the assembly JAR; that check could
be automated.
For testing that the spark integration with s3a, wasb, etc works, I'd
propose a followup piece of work. That'd add to spark-cloud some tests and
changes to the pom needed to pass down some environment options for running the
tests, skipping them if the credentials are not provided.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/steveloughran/spark features/SPARK-7481-cloud
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12004.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12004
----
commit 5e9cfbe30a5aff78e5b807a2d2cf38aa1a2b814d
Author: Steve Loughran <[email protected]>
Date: 2016-03-28T17:40:59Z
[SPARK-7481] Add Hadoop 2.6+ profile to pull in object store FS accessors
in hadoop-openstack, hadoop-aws and hadoop-azure.
This defines a new module, "cloud" with transitive dependencies on the
amazon (hadoop 2.6+) and azure (hadoop 2.7+) JARs. The spark assembly JAR pulls
in spark-cloud and its hadoop dependencies (scoped at hadoop-provided) âbut
excludes those external dependencies. The hadoop classes come in (visually
verified in JAR); the com.amazon and com.microsoft artifacts are omitted.
Having an explicit module allows for followup work, specifically some tests
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]