GitHub user steveloughran opened a pull request: https://github.com/apache/spark/pull/21066
[SPARK-23977][CLOUD][Wip] Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism ## What changes were proposed in this pull request? This patch has on SPARK-23807 as prerequisite; this PR initially includes it so it builds and tests independently. * Add source tree under `hadoop-cloud` which builds iff the `hadoop-3.1` profile is enabled. * Add a subclass of `HadoopMapReduceCommitProtocol` , `org.apache.spark.internal.io.cloud.PathOutputCommitProtocol`, which uses the Hadoop 3.1 `PathOutputCommitterFactory` to create the committers. * Add a `org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter` class which extends `ParquetOutputCommitter` to wire up Parquet output even when code requires the committer to be a `ParquetOutputCommitter`. If the application configures the spark context to use the new committers, then jobs will switch to the factory mechanism and so pick up the configured committer for the destination filesystem, with FileOutputCommitter being the standard default. If/when Spark switches to `org.apache.hadoop.mapreduce.lib.output` output classes it would get the factory binding automatically, except for Parquet, whose committer subclassing always complicates things. A binding class will always be needed there. ## How was this patch tested? Automated tests in [cloud-examples](https://github.com/hortonworks-spark/cloud-integration/tree/master/cloud-examples) test the commit mechanism working with: CSV, ORC and Parquet output. Tests include: * Copies of some of the tests in `org.apache.spark.sql.sources.HadoopFsRelationTest`, tests reworked to setup, probe and teardown with FileSystem instances, rather than the local FS direct. * Full funtional tests reading in public datasources, transforming and saving them using different committer options. * The Fault injection feature of Hadoop 3.1's S3A connector, which can simulate S3 listing inconsistency at a higher rate than would normally be seen; this will force eventual-consistency related bugs to surface. The S3A committers take advantage of the fact that writing a file is always atomic, and save summary data as JSON in the `_SUCCESS` file created in jobs. The tests use this to verify the correct committer was used in the different tests. The Hadoop 3/hive version check problem fails all the tests unless one of the following fixes is applied. * hadoop-3.x built locally with a false published version of 2.x (built with `-Ddeclared.hadoop.version=2.11` or similar) * spark built with a modified org.sparkproject.hive JAR. I've done both. You can merge this pull request into a Git repository by running: $ git pull https://github.com/steveloughran/spark cloud/SPARK-23977-pathoutputcommitters Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21066.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21066 ---- commit 29e73242cba9797ed24127b24bb0380c69a608d3 Author: Steve Loughran <stevel@...> Date: 2018-03-28T17:38:57Z SPARK-23807 Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding Change-Id: Ia4526f184ced9eef5b67aee9e91eced0dd38d723 commit 016d69090691631343d37f9704d0f37a84ddf297 Author: Steve Loughran <stevel@...> Date: 2018-03-29T15:04:02Z SPARK-23807 review set 1: * hadoop branch-2 dependencies always declared * minor nits in POM addressed * added log4j.properties for tests Change-Id: Ibb64b20a0be8624d1709e592b9fe85bdc4dd1af7 commit 942365763f90260e671629b519ce3dbbf7e5455e Author: Steve Loughran <stevel@...> Date: 2018-04-03T13:36:52Z SPARK-23807 move new hadoop-cloud source out to new PR; this contains the build with all the POM changes other than those adding the optional hadoop-3.02+ source tree to the spark-hadoop-cloud build Change-Id: Iccc2b66602db05db132ce5cf5c8546fe9a13a3fa commit 58c04e92da4f394b4983e48981f32040e92600e0 Author: Steve Loughran <stevel@...> Date: 2018-04-03T14:16:45Z HADOOP-13207 and switch to the RC hadoop 3.1 Change-Id: Ic13caf5fcf96d617085051579ede8380b2106119 commit 41845269f950a57968c473f90233d30b77a905dc Author: Steve Loughran <stevel@...> Date: 2018-04-05T14:08:35Z SPARK-23807 add the dependencies for the hadoop 3 profile. This includes the profile in test-dependencies.sh, so this part of the build will work: hive doesn't need to be working to build that dependency graph. Change-Id: I1ecfd4b1a8bea26600765b1de59f2425c42f6b03 commit 7c93d98aae8d74e0f0606cb03e68b0ac94bde177 Author: Steve Loughran <stevel@...> Date: 2018-04-05T17:52:26Z remove hadoop-3 as a profile to do a dependency check on, as hadoop 3.1 is still in staging Change-Id: Id2d5655088b2a8c2bdec43f7d17110a513be3f7c commit 036d92a0973276d9e583a3e6df58b60c2e5a64ad Author: Steve Loughran <stevel@...> Date: 2018-04-09T12:31:55Z Revert "remove hadoop-3 as a profile to do a dependency check on, as hadoop 3.1 is still in staging" This reverts commit 7c93d98aae8d74e0f0606cb03e68b0ac94bde177. commit 52a8c28c564f669aa2cb2998b471f6085fb0742b Author: Steve Loughran <stevel@...> Date: 2018-04-09T13:13:29Z SPARK-23807 Hadoop 3.1.0 is shipping: profile => "hadoop-3.1" and test-dependencies.sh knows about it Change-Id: Ie4906e2f41e9992e803674dce283f03b4dbab67e commit f6b9dc83d56c20d887166ddba7a7b876a57d65cb Author: Steve Loughran <stevel@...> Date: 2018-04-12T19:22:24Z SPARK-23807 unshaded jetty dependency fixup needed for Azure wasb:// jetty-util and jetty-util-ajax are forced into the dist/jars directory by explicit identification in the relevant POMs as in the hadoop-dist-scope. Without this they weren't coming in as spark-assembly was seeing jetty-util marked as provided. It's not needed for the spark-* JARs, which all use the shaded reference, but it is needed indirectly via hadoop-azure. This change to the poms reinstates it. Maven has proven surprisingly "fussy" here; the implication being its "closest declaration wins" resolution policy doesn't just control versions, it has influence over scoping. Change-Id: I081023cae84236c925fad4e94168f1dac5a8026a commit 3da1f3faa6601d38deb259203f2f48b17293f51d Author: Steve Loughran <stevel@...> Date: 2018-04-13T12:47:20Z SPARK-23977 Add committer binding to Hadoop 3.1 PathOutputCommitter Mechanism Change-Id: I66d249eb3a3ffe6ab0a7059aed174623072a27b6 ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org