Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114652036
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,190 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark 
SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## Introduction
    +
    +
    +All major cloud providers offer persistent data storage in *object stores*.
    +These are not classic "POSIX" file systems.
    +In order to store hundreds of petabytes of data without any single points 
of failure,
    +object stores replace the classic filesystem directory tree
    +with a simpler model of `object-name => data`. To enable remote access, 
operations
    +on objects are usually offered as (slow) HTTP REST operations.
    +
    +Spark can read and write data in object stores through filesystem 
connectors implemented
    +in Hadoop or provided by the infrastructure suppliers themselves.
    +These connectors make the object stores look *almost* like filesystems, 
with directories and files
    +and the classic operations on them such as list, delete and rename.
    +
    +
    +### Important: Cloud Object Stores are Not Real Filesystems
    +
    +While the stores appear to be filesystems, underneath
    +they are still object stores, [and the difference is 
significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
    +
    +They cannot be used as a direct replacement for a cluster filesystem such 
as HDFS
    +*except where this is explicitly stated*.
    +
    +Key differences are
    +
    +* Changes to stored objects may not be immediately visible, both in 
directory listings and actual data access.
    +* The means by which directories are emulated may make working with them 
slow.
    +* Rename operations may be very slow and, on failure, leave the store in 
an unknown state.
    +* Seeking within a file may require new HTTP calls, hurting performance. 
    +
    +How does affect Spark? 
    +
    +1. Reading and writing data can be significantly slower than working with 
a normal filesystem.
    +1. Some directory structures may be very inefficient to scan during query 
split calculation.
    +1. The output of work may not be immediately visible to a follow-on query.
    +1. The rename-based algorithm by which Spark normally commits work when 
saving an RDD, DataFrame or Dataset
    + is potentially both slow and unreliable.
    +
    +For these reasons, it is not always safe to use an object store as a 
direct destination of queries, or as
    +an intermediate store in a chain of queries. Consult the documentation of 
the object store and its
    +connector to determine which uses are considered safe.
    +
    +### Installation
    +
    +With the relevant libraries on the classpath and Spark configured with 
valid credentials,
    +objects can be can be read or written by using their URLs as the path to 
data.
    +For example `sparkContext.textFile("s3a://landsat-pds/scene_list.gz")` 
will create
    +an RDD of the file `scene_list.gz` stored in S3, using the s3a connector.
    +
    +To add the relevant libraries to an application's classpath, include the 
`spark-hadoop-cloud` 
    +module and its dependencies.
    +
    +In Maven, add the following to the `pom.xml` file, assuming `spark.version`
    +is set to the chosen version of Spark:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +Commercial products based on Apache Spark generally directly set up the 
classpath
    +for talking to cloud infrastructures, in which case this module may not be 
needed.
    +
    +### Authenticating
    +
    +Spark jobs must authenticate with the object stores to access data within 
them.
    +
    +1. When Spark is running in a cloud infrastructure, the credentials are 
usually automatically set up.
    +1. `spark-submit` reads the `AWS_ACCESS_KEY`, `AWS_SECRET_KEY`
    +and `AWS_SESSION_TOKEN` environment variables and sets the associated 
authentication options
    +for the `s3n` and `s3a` connectors to Amazon S3.
    +1. In a Hadoop cluster, settings may be set in the `core-site.xml` file.
    +1. Authentication details may be manually added to the Spark configuration 
in `spark-default.conf`
    +1. Alternatively, they can be programmatically set in the `SparkConf` 
instance used to configure 
    +the application's `SparkContext`.
    +
    +*Important: never check authentication secrets into source code 
repositories,
    +especially public ones*
    +
    +Consult [the Hadoop documentation](http://hadoop.apache.org/docs/current/) 
for the relevant
    +configuration and security options.
    +
    +## Configuring
    +
    +Each cloud connector has its own set of configuration parameters, again, 
    +consult the relevant documentation.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are some settings to use when writing to object stores. 
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +This uses the "version 2" algorithm for committing files, which does less
    +renaming than the "version 1" algorithm, though as it still uses `rename()`
    +to commit files, it may be unsafe to use.
    --- End diff --
    
    I'll try to think of a better phrasing, saying "if your object store is 
consistent enough use v2 for speed".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to