HyukjinKwon commented on a change in pull request #28465:
URL: https://github.com/apache/spark/pull/28465#discussion_r421179756
##########
File path: docs/rdd-programming-guide.md
##########
@@ -360,7 +360,7 @@ Some notes on reading files with Spark:
* If using a path on the local filesystem, the file must also be accessible at
the same path on worker nodes. Either copy the file to all workers or use a
network-mounted shared file system.
-* All of Spark's file-based input methods, including `textFile`, support
running on directories, compressed files, and wildcards as well. For example,
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and
`textFile("/my/directory/*.gz")`.
+* All of Spark's file-based input methods, including `textFile`, support
running on directories, compressed files, and wildcards as well. For example,
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and
`textFile("/my/directory/*.gz")`. When multiple files are read, the order of
elements in the resulting RDD is not guaranteed, as files can be read in any
order. Within a partition, element order is respected.
Review comment:
In case of RDD case you mentioned, #4204, I think Hadoop file system
uses a lexicographical order when it lists up files. So, sure, it will keep the
order in most cases but they are not fully guaranteed. So, the internal listing
order is inherited from Hadoop's handling.
This isn't specific to textFile either. SQL case is different as I described
above. It might be best to have a separate page to document.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]