[GitHub] spark pull request #16157: [SPARK-18723][DOC] Expanded programming guide inf...

srowen Sat, 10 Dec 2016 12:16:33 -0800

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16157#discussion_r91841458
  
    --- Diff: docs/programming-guide.md ---
    @@ -347,7 +347,7 @@ Some notes on reading files with Spark:
     
     Apart from text files, Spark's Scala API also supports several other data 
formats:
     
    -* `SparkContext.wholeTextFiles` lets you read a directory containing 
multiple small text files, and returns each of them as (filename, content) 
pairs. This is in contrast with `textFile`, which would return one record per 
line in each file.
    +* `SparkContext.wholeTextFiles` lets you read a directory containing 
multiple small text files, and returns each of them as (filename, content) 
pairs. This is in contrast with `textFile`, which would return one record per 
line in each file. It takes an optional second argument for controlling the 
minimal number of partitions (by default this is 2). It uses 
[CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html)
 internally in order to process large numbers of small files effectively by 
grouping files on the same executor into a single partition. This can lead to 
sub-optimal partitioning when the file sets would benefit from residing in 
multiple partitions (e.g., larger partitions would not fit in memory, files are 
replicated but a large subset is locally reachable from a single executor, 
subsequent transformations would benefit from multi-core processing). In those 
cases, set the `minPartitions` argume
 nt to enforce splitting.
    --- End diff --
    
    No, the difference is more fundamental. textFile returns lines of a file 
and wholeTextFiles returns entire contents of files. It is not a difference of 
partitioning. It's a different operation altogether.
    
    I think you'd generally let data locality define partitions, so I don't 
agree that the partitioning determined by CombineFileInputFormat is "usually" 
inappropriate. You can of course change the partitioning if it happens to be.
    
    As such if you want to update docs, I would stick to giving information 
that is actionable to the caller: if you end up with too few big partitions, 
increase this parameter. That's fairly well understood anyway but no harm in 
noting. Everything else doesn't seem like the right guidance.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16157: [SPARK-18723][DOC] Expanded programming guide inf...

Reply via email to