[GitHub] [spark] steveloughran commented on a change in pull request #24970: [SPARK-23977][SQL] Support High Performance S3A committers [test-hadoop3.2]

GitBox Fri, 02 Aug 2019 09:18:04 -0700

steveloughran commented on a change in pull request #24970: [SPARK-23977][SQL] 
Support High Performance S3A committers [test-hadoop3.2]
URL: https://github.com/apache/spark/pull/24970#discussion_r310202911


 ##########
 File path: 
hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/package.scala
 ##########
 @@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.io
+
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * Package object to assist in switching to the Hadoop 3
+ * [[org.apache.hadoop.mapreduce.lib.output.PathOutputCommitterFactory]] 
factory
+ * mechanism for dynamically loading committers for the destination stores.
+ *
+ * = Using Alternative Committers with Spark and Hadoop 3 =
+ *
+ * Hadoop 3.1 adds a means to select a different output committer when writing
+ * data to object stores. This can provide higher performance as well as
+ * addressing the consistency and atomicity problems encountered on some 
filesystems.
+ *
+ * Every object store can implement its own committer factory: the factory
+ * itself will then instantiated the committer of its choice.
+ *
+ * == Prerequisites ==
+ *
+ * The new APIs are in Hadoop-3.0.2, but the S3A connectors only
+ * found in Hadoop 3.1+
+ *
+ * The Hadoop cluster needs to be configured for the binding from filesystem 
scheme
+ * to factory. In Hadoop 3.1+ this is done automatically for s3a in the file
+ * `mapred-default.xml`.
+ * Other stores' committers may need to be explicitly declared.
+ *
+ * {{{
+ *   <property>
+ *   <name>mapreduce.outputcommitter.factory.scheme.s3a</name>
+ *   <value>org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory</value>
+ *   <description>
+ *     The committer factory to use when writing data to S3A filesystems.
+ *     If mapreduce.outputcommitter.factory.class is set, it will
+ *     override this property.
+ *   </description>
+ * </property>
+ * }}}
+ *
+ * == Binding a Spark Context to use the new committers for a store ==
+ *
+ * Spark uses the Hadoop committers in
+ * 
[[org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand]]
+ * by instantiating and then invoking an instance of
+ * [[org.apache.spark.internal.io.HadoopMapReduceCommitProtocol]].
+ * `InsertIntoHadoopFsRelationCommand` needs to be configured to use
+ * [[org.apache.spark.internal.io.cloud.PathOutputCommitProtocol]] as
+ * the commit protocol to use. This instantiates the committer through
+ * the factory mechanism, and relays operations to it.
+ *
+ * When working with Parquet data, you need to explicitly switch
+ * the Parquet committers to use the same mechanism
+ *
+ * Consult the cloud-integration document for the specific details on how to
+ * do this.
+ *
+ */
+package object cloud {
+
+  /**
+   * The classname to use when referring to the path output committer.
+   */
+  val PATH_COMMIT_PROTOCOL_CLASSNAME: String = 
classOf[PathOutputCommitProtocol].getName
+
+  /**
+   * The name of the parquet committer.
+   */
+  val PARQUET_COMMITTER_CLASSNAME: String = 
classOf[BindingParquetOutputCommitter].getName
+
+  /**
+   * Options for committer setup.
+   * When applied to a spark configuration, this will set the
+   * Dataframe output to use the factory mechanism for writing data for
+   * all file formats.
+   */
+  val COMMITTER_BINDING_OPTIONS: Map[String, String] = Map(
 
 Review comment:
   ok. There's enough in the docs, I'll cut it

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran commented on a change in pull request #24970: [SPARK-23977][SQL] Support High Performance S3A committers [test-hadoop3.2]

Reply via email to