[GitHub] [spark] wetneb commented on a change in pull request #28465: [SPARK-5300][CORE][DOCS] Mention lack of RDD order preservation after deserialization

GitBox Thu, 07 May 2020 00:19:01 -0700


wetneb commented on a change in pull request #28465:
URL: https://github.com/apache/spark/pull/28465#discussion_r421289544




##########
File path: docs/rdd-programming-guide.md
##########
@@ -360,7 +360,7 @@ Some notes on reading files with Spark:
 
 * If using a path on the local filesystem, the file must also be accessible at 
the same path on worker nodes. Either copy the file to all workers or use a 
network-mounted shared file system.
 
-* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`.
+* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`. When multiple files are read, the order of 
elements in the resulting RDD is not guaranteed, as files can be read in any 
order. Within a partition, element order is respected.

Review comment:
       Another approach would be to mention in each method of RDD and 
SparkContext (and Dataset, SparkSession) whether they preserve the order or 
not. I would be interested in preservation of partitioning too, it could be 
documented in the same way.
   
   Perhaps there could even be annotations on methods which preserve these 
aspects (which would potentially let users implement automated checks for calls 
to methods which do not preserve these things?).
   
   The problem with writing up a separate page/section about this in the docs 
is that it is likely to go out of sync with the API.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] wetneb commented on a change in pull request #28465: [SPARK-5300][CORE][DOCS] Mention lack of RDD order preservation after deserialization

Reply via email to