[GitHub] spark pull request #21066: [SPARK-23977][CLOUD][Wip] Add commit protocol bin...

steveloughran Fri, 13 Apr 2018 06:17:00 -0700

GitHub user steveloughran opened a pull request:

    https://github.com/apache/spark/pull/21066


    [SPARK-23977][CLOUD][Wip] Add commit protocol binding to Hadoop 3.1 
PathOutputCommitter mechanism

    ## What changes were proposed in this pull request?
    
    This patch has on SPARK-23807 as prerequisite; this PR initially includes 
it so it builds and tests independently.
    
    * Add source tree under `hadoop-cloud` which builds iff the `hadoop-3.1` 
profile is enabled.
    * Add a subclass of `HadoopMapReduceCommitProtocol` , 
`org.apache.spark.internal.io.cloud.PathOutputCommitProtocol`, which uses the 
Hadoop 3.1 `PathOutputCommitterFactory` to create the committers.
    * Add a `org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter` 
class which extends `ParquetOutputCommitter` to wire up Parquet output even 
when code requires the committer to be a `ParquetOutputCommitter`.
    
    If the application configures the spark context to use the new committers, 
then jobs will switch to the factory mechanism and so pick up the configured 
committer for the destination filesystem, with FileOutputCommitter being the 
standard default. 
    If/when Spark switches to `org.apache.hadoop.mapreduce.lib.output` output 
classes it would get the factory binding automatically, except for Parquet, 
whose committer subclassing always complicates things. A binding class will 
always be needed there.
    
    ## How was this patch tested?
    
    Automated tests in 
[cloud-examples](https://github.com/hortonworks-spark/cloud-integration/tree/master/cloud-examples)
 test the commit mechanism working with: CSV, ORC and Parquet output. 
    
    Tests include:
    
    * Copies of some of the tests in 
`org.apache.spark.sql.sources.HadoopFsRelationTest`, tests reworked to setup, 
probe and teardown with FileSystem instances, rather than the local FS direct.
    * Full funtional tests reading in public datasources, transforming and 
saving them using different committer options.
    * The Fault injection feature of Hadoop 3.1's S3A connector, which can 
simulate S3 listing inconsistency at a higher rate than would normally be seen; 
this will force eventual-consistency related bugs to surface.
    
    The S3A committers take advantage of the fact that writing a file is always 
atomic, and save summary data as JSON in the `_SUCCESS` file created in jobs. 
The tests use this to verify the correct committer was used in the different 
tests.
    
    The Hadoop 3/hive version check problem fails all the tests unless one of 
the following fixes is applied.
    * hadoop-3.x built locally with a false published version of 2.x (built 
with `-Ddeclared.hadoop.version=2.11` or similar)
    * spark built with a modified org.sparkproject.hive JAR.
    
    I've done both.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/spark 
cloud/SPARK-23977-pathoutputcommitters

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21066.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21066
    
----
commit 29e73242cba9797ed24127b24bb0380c69a608d3
Author: Steve Loughran <stevel@...>
Date:   2018-03-28T17:38:57Z

    SPARK-23807 Add Hadoop 3 profile with relevant POM fix ups, cloud-storage 
artifacts and binding
    
    Change-Id: Ia4526f184ced9eef5b67aee9e91eced0dd38d723

commit 016d69090691631343d37f9704d0f37a84ddf297
Author: Steve Loughran <stevel@...>
Date:   2018-03-29T15:04:02Z

    SPARK-23807 review set 1:
    * hadoop branch-2 dependencies always declared
    * minor nits in POM addressed
    * added log4j.properties for tests
    
    Change-Id: Ibb64b20a0be8624d1709e592b9fe85bdc4dd1af7

commit 942365763f90260e671629b519ce3dbbf7e5455e
Author: Steve Loughran <stevel@...>
Date:   2018-04-03T13:36:52Z

    SPARK-23807 move new hadoop-cloud source out to new PR; this contains the 
build with all the POM changes other than those adding the optional 
hadoop-3.02+ source tree to the spark-hadoop-cloud build
    
    Change-Id: Iccc2b66602db05db132ce5cf5c8546fe9a13a3fa

commit 58c04e92da4f394b4983e48981f32040e92600e0
Author: Steve Loughran <stevel@...>
Date:   2018-04-03T14:16:45Z

    HADOOP-13207 and switch to the RC hadoop 3.1
    
    Change-Id: Ic13caf5fcf96d617085051579ede8380b2106119

commit 41845269f950a57968c473f90233d30b77a905dc
Author: Steve Loughran <stevel@...>
Date:   2018-04-05T14:08:35Z

    SPARK-23807 add the dependencies for the hadoop 3 profile.
    
    This  includes the profile in test-dependencies.sh, so this part of the 
build will work: hive doesn't need to be working to build that dependency graph.
    
    Change-Id: I1ecfd4b1a8bea26600765b1de59f2425c42f6b03

commit 7c93d98aae8d74e0f0606cb03e68b0ac94bde177
Author: Steve Loughran <stevel@...>
Date:   2018-04-05T17:52:26Z

    remove hadoop-3 as a profile to do a dependency check on, as hadoop 3.1 is 
still in staging
    
    Change-Id: Id2d5655088b2a8c2bdec43f7d17110a513be3f7c

commit 036d92a0973276d9e583a3e6df58b60c2e5a64ad
Author: Steve Loughran <stevel@...>
Date:   2018-04-09T12:31:55Z

    Revert "remove hadoop-3 as a profile to do a dependency check on, as hadoop 
3.1 is still in staging"
    
    This reverts commit 7c93d98aae8d74e0f0606cb03e68b0ac94bde177.

commit 52a8c28c564f669aa2cb2998b471f6085fb0742b
Author: Steve Loughran <stevel@...>
Date:   2018-04-09T13:13:29Z

    SPARK-23807 Hadoop 3.1.0 is shipping: profile => "hadoop-3.1" and 
test-dependencies.sh knows about it
    
    Change-Id: Ie4906e2f41e9992e803674dce283f03b4dbab67e

commit f6b9dc83d56c20d887166ddba7a7b876a57d65cb
Author: Steve Loughran <stevel@...>
Date:   2018-04-12T19:22:24Z

    SPARK-23807 unshaded jetty dependency fixup needed for Azure wasb://
    
    jetty-util and jetty-util-ajax are forced into the dist/jars directory by
    explicit identification in the relevant POMs as in the hadoop-dist-scope.
    
    Without this they weren't coming in as spark-assembly was seeing jetty-util 
marked
    as provided. It's not needed for the spark-* JARs, which all use the shaded 
reference,
    but it is needed indirectly via hadoop-azure. This change to the poms 
reinstates it.
    
    Maven has proven surprisingly "fussy" here; the implication being its 
"closest declaration wins"
    resolution policy doesn't just control versions, it has influence over 
scoping.
    
    Change-Id: I081023cae84236c925fad4e94168f1dac5a8026a

commit 3da1f3faa6601d38deb259203f2f48b17293f51d
Author: Steve Loughran <stevel@...>
Date:   2018-04-13T12:47:20Z

    SPARK-23977 Add committer binding to Hadoop 3.1 PathOutputCommitter 
Mechanism
    
    Change-Id: I66d249eb3a3ffe6ab0a7059aed174623072a27b6

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21066: [SPARK-23977][CLOUD][Wip] Add commit protocol bin...

Reply via email to