Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/16157#discussion_r90970609
--- Diff: docs/programming-guide.md ---
@@ -347,7 +347,7 @@ Some notes on reading files with Spark:
Apart from text files, Spark's Scala API also supports several other data
formats:
-* `SparkContext.wholeTextFiles` lets you read a directory containing
multiple small text files, and returns each of them as (filename, content)
pairs. This is in contrast with `textFile`, which would return one record per
line in each file.
+* `SparkContext.wholeTextFiles` lets you read a directory containing
multiple small text files, and returns each of them as (filename, content)
pairs. This is in contrast with `textFile`, which would return one record per
line in each file. It takes an optional second argument for controlling the
minimal number of partitions (by default this is 2). It uses
[CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html)
internally in order to process large numbers of small files effectively by
grouping files on the same node into a single split. (This can lead to
non-optimal partitioning. It is therefore advisable to set the minimal number
of partitions explicitly.)
--- End diff --
(What is a 'node' here in the context of Spark -- executor? I'm also not
sure this behavior is guaranteed, from reading the code and docs)
I don't know that the implementation detail matters here as much as what
problem the end user might solve by setting this. You might instead say that
this can lead to many small files in relatively few partitions, and this is why
you might wish to set a minimum. That's kind of what the docs suggest already;
maybe this is better as a tiny improvement to the scaladoc?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]