[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ profi...

steveloughran Mon, 28 Mar 2016 10:57:35 -0700

GitHub user steveloughran opened a pull request:

    https://github.com/apache/spark/pull/12004


    [SPARK-7481][build][WIP] Add Hadoop 2.6+ profile to pull in object store FS 
accesors

    ## What changes were proposed in this pull request?
    
    [SPARK-7481] Add Hadoop 2.6+ profile to pull in object store FS accessors 
in hadoop-openstack, hadoop-aws and hadoop-azure. 
    
    As a result, it gets the Hadoop classes s3n:// support back into 
spark-assembly; adds s3a and openstack on Hadoop 2.6, and on Hadoop 2.7 adds 
azure support. It does not add the dependencies needed for s3a or azure
    
    - spark-assembly has an explicit dependency on jets3t; this is used by s3n
    - s3a needs a (large) amazon-aws JAR in Hadoop 2.6; Hadoop 2.7 has switched 
to a leaner amazon-aws-s3 JAR.
    - azure needs a microsoft azure storage JAR
    - openstack reuses JARs already in the assembly and adds one: commons-io. 
This is not excluded from spark-assembly, though it is easy to add that.
    
    The patch defines a new module, "cloud" with transitive dependencies on the 
amazon (hadoop 2.6+) and azure (hadoop 2.7+) JARs. The spark assembly JAR pulls 
in spark-cloud and its hadoop dependencies (scoped at hadoop-provided) âbut 
excludes those external dependencies. 
    
    Having an explicit module allows for followup work, specifically some 
tests. It also enables downstream applications to declare their dependency upon 
`spark-cloud` and get the the object store accessors (and anything else people 
choose to add in future)
    
    ## How was this patch tested?
    
    The dependency logic was verified via maven dependency checking; the 
inclusion of the hadoop code and exclusion of com.microsoft and com.amazon 
files checked by examining the contents of the assembly JAR; that check could 
be automated.
    
    For testing that the spark integration with s3a, wasb, etc works, I'd 
propose a followup piece of work. That'd add to spark-cloud some tests and 
changes to the pom needed to pass down some environment options for running the 
tests, skipping them if the credentials are not provided.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/spark features/SPARK-7481-cloud

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12004.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12004
    
----
commit 5e9cfbe30a5aff78e5b807a2d2cf38aa1a2b814d
Author: Steve Loughran <[email protected]>
Date:   2016-03-28T17:40:59Z

    [SPARK-7481] Add Hadoop 2.6+ profile to pull in object store FS accessors 
in hadoop-openstack, hadoop-aws and hadoop-azure.
    
    This defines a new module, "cloud" with transitive dependencies on the 
amazon (hadoop 2.6+) and azure (hadoop 2.7+) JARs. The spark assembly JAR pulls 
in spark-cloud and its hadoop dependencies (scoped at hadoop-provided) âbut 
excludes those external dependencies. The hadoop classes come in (visually 
verified in JAR); the com.amazon and com.microsoft artifacts are omitted.
    
    Having an explicit module allows for followup work, specifically some tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ profi...

Reply via email to