[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

steveloughran Tue, 02 May 2017 11:33:39 -0700

GitHub user steveloughran opened a pull request:

    https://github.com/apache/spark/pull/17834


    [SPARK-7481] [build] Add spark-hadoop-cloud module to pull in object store 
access.

    ## What changes were proposed in this pull request?
    
    Add a new `spark-hadoop-cloud ` module and maven profile to pull in object 
store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 
2.7+) JARs, along with their dependencies, fixing up the dependencies so that 
everything works, in particular Jackson.
    
    It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack 
`swift://` and azure `wasb://`.
    
    There's a documentation page, `cloud_integration.md`, which covers the 
basic details of using Spark with object stores, referring the reader to the  
supplier's own documentation, with specific warnings on security and the 
possible mismatch between a store's behavior and that of a filesystem.
    In particular, users are advised be very cautious when trying to use an 
object store as the destination of data, and to consult the documentation of 
the storage supplier and the connector. 
    
    (this is the successor to #12004; I can't re-open it)
    
    ## How was this patch tested?
    
    I tests in 
[https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)
    
    Those verify that the dependencies are sufficient to allow downstream 
applications to work with s3a, azure wasb and swift storage connectors, and 
perform basic IO & dataframe operations thereon. All seems well.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/spark cloud/SPARK-7481-current

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17834.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17834
    
----
commit 1da9a3d181e5226a0ae9379c0c8905b319a4afe9
Author: Steve Loughran <[email protected]>
Date:   2016-11-18T15:50:15Z

    [SPARK-7481] stripped down packaging only module

commit 028d9ed428638520239da7d2b619d20817df56fd
Author: Steve Loughran <[email protected]>
Date:   2016-11-18T17:02:53Z

    [SPARK-7481] basic instantiation tests verify that dependency hadoop-azure, 
hadoop-aws, hadoop-openstack and implicitly their transitive dependencies are 
resolved. They don't verify all dependency setup, specifically that Jackson 
versions are consistent; that needs integration testing.

commit ace46e98e913ee68c0aca88d17eeb0f055da074b
Author: Steve Loughran <[email protected]>
Date:   2016-11-18T19:04:53Z

    [SPARK-7481] tests restricted to instantiation; logging modified 
appropriately

commit 3f6dfdad893d083e4653c547fcd6406a91dd9544
Author: Steve Loughran <[email protected]>
Date:   2016-11-21T12:07:25Z

    [SPARK-7481] declare httpcomponents:httpclient explicitly, as downstream 
tests which pulled in spark-cloud but not spark-hive were ending up with 
inconsistent versions. Add a test for the missing class being there too.

commit 5f8f996cea76a16391073b46023a981cba3b3cce
Author: Steve Loughran <[email protected]>
Date:   2016-11-21T17:56:05Z

    [SPARK-7481] update docs by culling section on cloud integration tests; 
link to remaning docs from top level.

commit e92a49322dfdb777e996e9b07b298bb8ae8967d6
Author: Steve Loughran <[email protected]>
Date:   2016-11-28T15:44:10Z

    [SPARK-7481]  updated documentation as per review

commit 97e80e1963b8f64905165c08974197cf4cd68356
Author: Steve Loughran <[email protected]>
Date:   2016-11-28T15:44:30Z

    [SPARK-7481]  SBT will build this now, optionally

commit ef3cebfd1baf928c3f30380f662eaee13ee6ca08
Author: Steve Loughran <[email protected]>
Date:   2016-11-28T15:45:44Z

    [SPARK-7481] cloud POM includes jackson-dataformat-cbor, so that the CP is 
set up consistently for the later versions of the AWS SDK

commit 66650c7c7d4d9e2cb640175428bf16a343d6319b
Author: Steve Loughran <[email protected]>
Date:   2016-12-01T13:30:48Z

    [SPARK-7481]  rebase with master; Pom had got out of sync

commit 31cc37e90f2dcb0ebbe696bc08d951e0526293f9
Author: Steve Loughran <[email protected]>
Date:   2016-12-02T17:39:52Z

    [SPARK-7481] rename spark-cloud module to spark-hadoo-cloud, in POMs and 
docs

commit 2fc6f23b5397f344583c0e192f88fb40bb88f6ad
Author: Steve Loughran <[email protected]>
Date:   2016-12-14T15:47:10Z

    [SPARK-7841] bump up cloud pom to 2.2.0-SNAPSHOT; other minor pom cleanup

commit 65f6814ccba464dbba1c8a5390638291c7c3cf1a
Author: Steve Loughran <[email protected]>
Date:   2017-01-10T14:07:18Z

    [SPARK-7481] builds against Hadoop shaded 3.x clients failing as direct 
references to AWS classes failing. Cut them and rely on transitive load through 
FS class instantation to force the load. All that happens is that failures to 
link will be slightly less easy to debug.

commit 73820a341cbbdecdd386a1448300439577273671
Author: Steve Loughran <[email protected]>
Date:   2017-01-20T13:52:45Z

    [SPARK-7481] update 2.7 dependencies to include azure, aws and openstack 
JARs, transitive dependencies on aws and azure SDKs

commit 824d801d43000161533dd50c9e2c7d2f1a1f7a0b
Author: Steve Loughran <[email protected]>
Date:   2017-01-30T14:27:39Z

    [SPARK-7481] add joda time as the dependency. Tested against hadoop 
branch-2, s3 ireland

commit 12a1b8488968917e4d99a39c7dd3ac2d39f87727
Author: Steve Loughran <[email protected]>
Date:   2017-02-24T14:30:29Z

    SPARK-7481 purge all tests from the cloud module

commit a7a2deca3cf00488682e355b41c716ecce57a62f
Author: Steve Loughran <[email protected]>
Date:   2017-03-20T14:10:12Z

    SPARK-7481 add cloud module to sbt sequence
    
    Change-Id: I3dea2544f089615493163f0fae482992873f9c35

commit 02f6e19bef8d7e1e0622d04bf47bb2c785996877
Author: Steve Loughran <[email protected]>
Date:   2017-03-20T14:14:37Z

    SPARK-7481 break line of mvn XML declaration
    
    Change-Id: Ibd6d40df2bc8a2edf19a058c458bea233ba414fd

commit ce042d2405706bc7cd6b0d2a410c36346be0c86e
Author: Steve Loughran <[email protected]>
Date:   2017-03-20T19:19:49Z

    SPARK-7481 cloud pom is still JAR (not pom). works against Hadoop 2.6 as 
well as 2.7, keeping azure the 2.7.x dependency. All dependencies are scoped @ 
hadoop.scope
    
    Change-Id: I80bd95fd48e21cf2eb4d94907ac99081cd3bd375

commit a98575370d9af1cda2c8b05672beea101ec6e83e
Author: Steve Loughran <[email protected]>
Date:   2017-04-27T15:07:10Z

    SPARK-7481 move to Spark 2.3.0-SNAPSHOT
    
    Change-Id: I91f764aeed7d832df1538453d869a7fd83964d65

commit 0e0527d62295b1d18a53ab12ac12fddaddf7be94
Author: Steve Loughran <[email protected]>
Date:   2017-04-27T20:18:06Z

    tweaked pom; updated docs
    
    Change-Id: I12ea6ed72ffa9edee964c90c862ff4c45bc4f47f

commit b78158f7aaeaebda206c30ea3e620b3775b3481b
Author: Steve Loughran <[email protected]>
Date:   2017-04-28T14:50:58Z

    SPARK-7481 strip down the docs to a bare minimum: FS differences, security, 
spark-specific options + links elsewhere
    
    Change-Id: I7e9efe20d116802a403af875b241b91178078d78

commit de3e95bfaa012fe8003d030fe84b00259d7610aa
Author: Steve Loughran <[email protected]>
Date:   2017-04-28T15:52:06Z

    SPARK-7481 doc review
    
    Change-Id: I1923a4b6a959d86aa2c5b3d71faaaf2541d3ba85

commit 9b1579b04646e8581482d2b37e8b3d984be7dd75
Author: Steve Loughran <[email protected]>
Date:   2017-04-28T17:26:10Z

    review comments
    
    Change-Id: I6a0b0b9f06a4adcdf55ef75161dc1039961bc7a1

commit 844e2551daad0ecfd1f870c4d3e130e361c454c1
Author: Steve Loughran <[email protected]>
Date:   2017-05-02T13:44:10Z

    SPARK-7481 more proofreading
    
    Change-Id: Ic4804667af8e52b7be11fb00621ad8b69a1d2569

commit 72a03ed58331813b0ad4bc9517fcc1f23a5eda6f
Author: Steve Loughran <[email protected]>
Date:   2017-05-02T18:21:46Z

    SPARK-7481 proofreading docs
    
    Change-Id: I2b75a2722f0082b916b9be20bd23a0bdc2d36615

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Reply via email to