Github user steveloughran commented on a diff in the pull request:
https://github.com/apache/spark/pull/17834#discussion_r114652036
--- Diff: docs/cloud-integration.md ---
@@ -0,0 +1,190 @@
+---
+layout: global
+displayTitle: Integration with Cloud Infrastructures
+title: Integration with Cloud Infrastructures
+description: Introduction to cloud storage support in Apache Spark
SPARK_VERSION_SHORT
+---
+<!---
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License. See accompanying LICENSE file.
+-->
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Introduction
+
+
+All major cloud providers offer persistent data storage in *object stores*.
+These are not classic "POSIX" file systems.
+In order to store hundreds of petabytes of data without any single points
of failure,
+object stores replace the classic filesystem directory tree
+with a simpler model of `object-name => data`. To enable remote access,
operations
+on objects are usually offered as (slow) HTTP REST operations.
+
+Spark can read and write data in object stores through filesystem
connectors implemented
+in Hadoop or provided by the infrastructure suppliers themselves.
+These connectors make the object stores look *almost* like filesystems,
with directories and files
+and the classic operations on them such as list, delete and rename.
+
+
+### Important: Cloud Object Stores are Not Real Filesystems
+
+While the stores appear to be filesystems, underneath
+they are still object stores, [and the difference is
significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
+
+They cannot be used as a direct replacement for a cluster filesystem such
as HDFS
+*except where this is explicitly stated*.
+
+Key differences are
+
+* Changes to stored objects may not be immediately visible, both in
directory listings and actual data access.
+* The means by which directories are emulated may make working with them
slow.
+* Rename operations may be very slow and, on failure, leave the store in
an unknown state.
+* Seeking within a file may require new HTTP calls, hurting performance.
+
+How does affect Spark?
+
+1. Reading and writing data can be significantly slower than working with
a normal filesystem.
+1. Some directory structures may be very inefficient to scan during query
split calculation.
+1. The output of work may not be immediately visible to a follow-on query.
+1. The rename-based algorithm by which Spark normally commits work when
saving an RDD, DataFrame or Dataset
+ is potentially both slow and unreliable.
+
+For these reasons, it is not always safe to use an object store as a
direct destination of queries, or as
+an intermediate store in a chain of queries. Consult the documentation of
the object store and its
+connector to determine which uses are considered safe.
+
+### Installation
+
+With the relevant libraries on the classpath and Spark configured with
valid credentials,
+objects can be can be read or written by using their URLs as the path to
data.
+For example `sparkContext.textFile("s3a://landsat-pds/scene_list.gz")`
will create
+an RDD of the file `scene_list.gz` stored in S3, using the s3a connector.
+
+To add the relevant libraries to an application's classpath, include the
`spark-hadoop-cloud`
+module and its dependencies.
+
+In Maven, add the following to the `pom.xml` file, assuming `spark.version`
+is set to the chosen version of Spark:
+
+{% highlight xml %}
+<dependencyManagement>
+ ...
+ <dependency>
+ <groupId>org.apache.spark</groupId>
+ <artifactId>spark-hadoop-cloud_2.11</artifactId>
+ <version>${spark.version}</version>
+ </dependency>
+ ...
+</dependencyManagement>
+{% endhighlight %}
+
+Commercial products based on Apache Spark generally directly set up the
classpath
+for talking to cloud infrastructures, in which case this module may not be
needed.
+
+### Authenticating
+
+Spark jobs must authenticate with the object stores to access data within
them.
+
+1. When Spark is running in a cloud infrastructure, the credentials are
usually automatically set up.
+1. `spark-submit` reads the `AWS_ACCESS_KEY`, `AWS_SECRET_KEY`
+and `AWS_SESSION_TOKEN` environment variables and sets the associated
authentication options
+for the `s3n` and `s3a` connectors to Amazon S3.
+1. In a Hadoop cluster, settings may be set in the `core-site.xml` file.
+1. Authentication details may be manually added to the Spark configuration
in `spark-default.conf`
+1. Alternatively, they can be programmatically set in the `SparkConf`
instance used to configure
+the application's `SparkContext`.
+
+*Important: never check authentication secrets into source code
repositories,
+especially public ones*
+
+Consult [the Hadoop documentation](http://hadoop.apache.org/docs/current/)
for the relevant
+configuration and security options.
+
+## Configuring
+
+Each cloud connector has its own set of configuration parameters, again,
+consult the relevant documentation.
+
+### Recommended settings for writing to object stores
+
+Here are some settings to use when writing to object stores.
+
+```
+spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
+spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
+```
+
+This uses the "version 2" algorithm for committing files, which does less
+renaming than the "version 1" algorithm, though as it still uses `rename()`
+to commit files, it may be unsafe to use.
--- End diff --
I'll try to think of a better phrasing, saying "if your object store is
consistent enough use v2 for speed".
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]