Github user steveloughran commented on a diff in the pull request:
https://github.com/apache/spark/pull/12004#discussion_r113967945
--- Diff: docs/cloud-integration.md ---
@@ -0,0 +1,512 @@
+---
+layout: global
+displayTitle: Integration with Cloud Infrastructures
+title: Integration with Cloud Infrastructures
+description: Introduction to cloud storage support in Apache Spark
SPARK_VERSION_SHORT
+---
+<!---
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License. See accompanying LICENSE file.
+-->
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## <a name="introduction"></a>Introduction
+
+
+All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google
GCS and others offer
+persistent data storage systems, "object stores". These are not quite the
same as classic file
+systems: in order to scale to hundreds of Petabytes, without any single
points of failure
+or size limits, object stores, "blobstores", have a simpler model of `name
=> data`.
+
+Apache Spark can read or write data in object stores for data access.
+through filesystem connectors implemented in Apache Hadoop or provided by
third-parties.
+These libraries make the object stores look *almost* like filesystems,
with directories and
+operations on files (rename) and directories (create, rename, delete)
which mimic
+those of a classic filesystem. Because of this, Spark and Spark-based
applications
+can work with object stores, generally treating them as as if they were
slower-but-larger filesystems.
+
+With these connectors, Apache Spark supports object stores as the source
+of data for analysis, including Spark Streaming and DataFrames.
+
+
+## <a name="quick_start"></a>Quick Start
+
+Provided the relevant libraries are on the classpath, and Spark is
configured with your credentials,
+objects in an object store can be can be read or written through URLs
which uses the name of the
+object store client as the schema and the bucket/container as the hostname.
+
+
+### Dependencies
+
+The Spark application neeeds the relevant Hadoop libraries, which can
+be done by including the `spark-hadoop-cloud` module for the specific
version of spark used.
+
+The Spark application should include <code>hadoop-openstack</code>
dependency, which can
+be done by including the `spark-hadoop-cloud` module for the specific
version of spark used.
+For example, for Maven support, add the following to the
<code>pom.xml</code> file:
+
+{% highlight xml %}
+<dependencyManagement>
+ ...
+ <dependency>
+ <groupId>org.apache.spark</groupId>
+ <artifactId>spark-hadoop-cloud_2.11</artifactId>
+ <version>${spark.version}</version>
+ </dependency>
+ ...
+</dependencyManagement>
+{% endhighlight %}
+
+If using the Scala 2.10-compatible version of Spark, the artifact is of
course `spark-hadoop-cloud_2.10`.
+
+### Basic Use
+
+You can refer to data in an object store just as you would data in a
filesystem, by
+using a URL to the data in methods like `SparkContext.textFile()` to read
data,
+`saveAsTextFile()` to write it back.
+
+
+Because object stores are viewed by Spark as filesystems, object stores can
+be used as the source or destination of any spark work âbe it batch,
SQL, DataFrame,
+Streaming or something else.
+
+The steps to do so are as follows
+
+1. Use the full URI to refer to a bucket, including the prefix for the
client-side library
+to use. Example: `s3a://landsat-pds/scene_list.gz`
+1. Have the Spark context configured with the authentication details of
the object store.
+In a YARN cluster, this may also be done in the `core-site.xml` file.
+
+
+## <a name="output"></a>Object Stores as a substitute for HDFS
+
+As the examples show, you can write data to object stores. However, that
does not mean
+That they can be used as replacements for a cluster-wide filesystem.
+
+The full details are covered in [Cloud Object Stores are Not Real
Filesystems](#cloud_stores_are_not_filesystems).
+
+The brief summary is:
+
+| Object Store Connector | Replace HDFS? |
+|-----------------------------|--------------------|
+| `s3a://` `s3n://` from the ASF | No |
+| Amazon EMR `s3://` | Yes |
+| Microsoft Azure `wasb://` | Yes |
+| OpenStack `swift://` | No |
+
+It is possible to use any of the object stores as a destination of work,
i.e. use
+`saveAsTextFile()` or `save()` to save data there, but the commit process
may be slow
+and, unreliable in the presence of failures.
+
+It is faster and safer to use the cluster filesystem as the destination of
Spark jobs,
--- End diff --
I'm going to change it to "consult the docs of the connector and the object
store". That allows for things to change over time without the spark docs
changing, and making it clear to the reader that this may be very dangerous.
It is a problem to use the result of a job because the commit process uses
the object store as the the location for uncommitted work. Both the task and
job commits rename() their output; task attempt output promoted to job, job to
dest, all of which are done by list + copy + delete. If you don't get that
listing right, you don't copy everything. task output doesn't get promoted to
job output *and nothing even notices*. Or, if there is only one output, the
listing of the parent directory returning 404, as there is no evidence a
directory exists yet. At least there the job fails; look at the final stack
trace HADOOP-11487 for that surfacing in Spark SQL.
*The more I know about the standard commit algorithm, the S3 consistency
model & what S3N/S3A do, the more surprised I am that it has ever worked with
S3.*
The Netflix staging committer addresses this by using HDFS to manage
(consistently) all the data about the ongoing job: when each task commits it
does an uncommitted multipart put to the final destination dir, saving all the
pending commit metadata data to a file in HDFS, relying on the usual v1 commit
algorithm to commit/abort that until the final job commit. It then reads in all
the successfully promised summary files, completes their writes by a single
POST each. This means we can avoid both the rename and a need for consistency
on the dest FS, at least for a single query. Chaining the queries still needs
consistency (S3mper, etc), or a least a long enough delay for the changes of
the first query to propagate.
regarding distcp: you do the work locally and then upload. As long as
DistCp doesn't try to list/manipulate the uploaded file, no issues.
HADOOP-13145 stops it doing that: before it went in, problems did surface.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]