[GitHub] [spark] HyukjinKwon commented on a change in pull request #28465: [SPARK-5300][CORE][DOCS] Mention lack of RDD order preservation after deserialization

GitBox Thu, 07 May 2020 04:18:01 -0700


HyukjinKwon commented on a change in pull request #28465:
URL: https://github.com/apache/spark/pull/28465#discussion_r421427690




##########
File path: docs/rdd-programming-guide.md
##########
@@ -360,7 +360,7 @@ Some notes on reading files with Spark:
 
 * If using a path on the local filesystem, the file must also be accessible at 
the same path on worker nodes. Either copy the file to all workers or use a 
network-mounted shared file system.
 
-* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`.
+* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`. When multiple files are read, the order of 
elements in the resulting RDD is not guaranteed, as files can be read in any 
order. Within a partition, element order is respected.

Review comment:
       Typically local file system is not used in production so it might not be 
a big deal at this moment.
   
   > Are you sure spark.sql.files.openCostInBytes and 
spark.sql.files.maxPartitionBytes have any influence on this?
   
   This affects SQL case - SQL APIs such as `spark.read.csv()` does not also 
guarantee the natural order and it can be indeterministic in the middle of the 
operation such as shuffle.
   So, the cause is different but the result is similar - indeterministic order.
   
   This is why I am thinking we should rather have a separate page to 
comprehensively elaborate this. I might not have to list up every API because 
it's more specific to how Spark works rather than how each API works.
   
    




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #28465: [SPARK-5300][CORE][DOCS] Mention lack of RDD order preservation after deserialization

Reply via email to