spark git commit: [SPARK-18723][DOC] Expanded programming guide information on wholeTex…

srowen Fri, 16 Dec 2016 09:44:23 -0800

Repository: spark
Updated Branches:
  refs/heads/master dc2a4d4ad -> 836c95b10



[SPARK-18723][DOC] Expanded programming guide information on wholeTexâ¦

## What changes were proposed in this pull request?

Add additional information to wholeTextFiles in the Programming Guide. Also 
explain partitioning policy difference in relation to textFile and its impact 
on performance.

Also added reference to the underlying CombineFileInputFormat

## How was this patch tested?

Manual build of documentation and inspection in browser

```
cd docs
jekyll serve --watch
```

Author: Michal Senkyr <[email protected]>

Closes #16157 from michalsenkyr/wholeTextFilesExpandedDocs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/836c95b1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/836c95b1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/836c95b1

Branch: refs/heads/master
Commit: 836c95b108ddd350b10796c97fc30b13371fb0fb
Parents: dc2a4d4
Author: Michal Senkyr <[email protected]>
Authored: Fri Dec 16 17:43:39 2016 +0000
Committer: Sean Owen <[email protected]>
Committed: Fri Dec 16 17:43:39 2016 +0000

----------------------------------------------------------------------
 core/src/main/scala/org/apache/spark/SparkContext.scala | 4 ++++
 docs/programming-guide.md                               | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/836c95b1/core/src/main/scala/org/apache/spark/SparkContext.scala
----------------------------------------------------------------------
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index 02c009c..bd3f454 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -851,6 +851,8 @@ class SparkContext(config: SparkConf) extends Logging {
    * @note Small files are preferred, large file is also allowable, but may 
cause bad performance.
    * @note On some filesystems, `.../path/&#42;` can be a more efficient way 
to read all files
    *       in a directory rather than `.../path/` or `.../path`
+   * @note Partitioning is determined by data locality. This may result in too 
few partitions
+   *       by default.
    *
    * @param path Directory to the input data files, the path can be comma 
separated paths as the
    *             list of inputs.
@@ -900,6 +902,8 @@ class SparkContext(config: SparkConf) extends Logging {
    * @note Small files are preferred; very large files may cause bad 
performance.
    * @note On some filesystems, `.../path/&#42;` can be a more efficient way 
to read all files
    *       in a directory rather than `.../path/` or `.../path`
+   * @note Partitioning is determined by data locality. This may result in too 
few partitions
+   *       by default.
    *
    * @param path Directory to the input data files, the path can be comma 
separated paths as the
    *             list of inputs.

http://git-wip-us.apache.org/repos/asf/spark/blob/836c95b1/docs/programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 353730c..a4017b5 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -347,7 +347,7 @@ Some notes on reading files with Spark:
 
 Apart from text files, Spark's Scala API also supports several other data 
formats:
 
-* `SparkContext.wholeTextFiles` lets you read a directory containing multiple 
small text files, and returns each of them as (filename, content) pairs. This 
is in contrast with `textFile`, which would return one record per line in each 
file.
+* `SparkContext.wholeTextFiles` lets you read a directory containing multiple 
small text files, and returns each of them as (filename, content) pairs. This 
is in contrast with `textFile`, which would return one record per line in each 
file. Partitioning is determined by data locality which, in some cases, may 
result in too few partitions. For those cases, `wholeTextFiles` provides an 
optional second argument for controlling the minimal number of partitions.
 
 * For 
[SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html),
 use SparkContext's `sequenceFile[K, V]` method where `K` and `V` are the types 
of key and values in the file. These should be subclasses of Hadoop's 
[Writable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html)
 interface, like 
[IntWritable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/IntWritable.html)
 and 
[Text](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html).
 In addition, Spark allows you to specify native types for a few common 
Writables; for example, `sequenceFile[Int, String]` will automatically read 
IntWritables and Texts.
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-18723][DOC] Expanded programming guide information on wholeTex…

Reply via email to