[spark] branch master updated: [MINOR][DOCS] Mention lack of RDD order preservation after deserialization

srowen Tue, 12 May 2020 06:30:03 -0700

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 59d9099  [MINOR][DOCS] Mention lack of RDD order preservation after 
deserialization
59d9099 is described below

commit 59d90997a52f78450fefbc96beba1d731b3678a1
Author: Antonin Delpeuch <anto...@delpeuch.eu>
AuthorDate: Tue May 12 08:27:43 2020 -0500

    [MINOR][DOCS] Mention lack of RDD order preservation after deserialization
    
    ### What changes were proposed in this pull request?
    
    This changes the docs to make it clearer that order preservation is not 
guaranteed when saving a RDD to disk and reading it back 
([SPARK-5300](https://issues.apache.org/jira/browse/SPARK-5300)).
    
    I added two sentences about this in the RDD Programming Guide.
    
    The issue was discussed on the dev mailing list:
    
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html
    
    ### Why are the changes needed?
    
    Because RDDs are order-aware collections, it is natural to expect that if I 
use `saveAsTextFile` and then load the resulting file with 
`sparkContext.textFile`, I obtain a RDD in the same order.
    
    This is unfortunately not the case at the moment and there is no agreed 
upon way to fix this in Spark itself (see PR #4204 which attempted to fix 
this). Users should be aware of this.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, two new sentences in the documentation.
    
    ### How was this patch tested?
    
    By checking that the documentation looks good.
    
    Closes #28465 from wetneb/SPARK-5300-docs.
    
    Authored-by: Antonin Delpeuch <anto...@delpeuch.eu>
    Signed-off-by: Sean Owen <sro...@gmail.com>
---
 docs/rdd-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md
index ba99007..70bfefc 100644
--- a/docs/rdd-programming-guide.md
+++ b/docs/rdd-programming-guide.md
@@ -360,7 +360,7 @@ Some notes on reading files with Spark:
 
 * If using a path on the local filesystem, the file must also be accessible at 
the same path on worker nodes. Either copy the file to all workers or use a 
network-mounted shared file system.
 
-* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`.
+* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`. When multiple files are read, the order of 
the partitions depends on the order the files are returned from the filesystem. 
It may or may not, for example, follow the lexicographic ordering of the files 
by path. Within a partiti [...]
 
 * The `textFile` method also takes an optional second argument for controlling 
the number of partitions of the file. By default, Spark creates one partition 
for each block of the file (blocks being 128MB by default in HDFS), but you can 
also ask for a higher number of partitions by passing a larger value. Note that 
you cannot have fewer partitions than blocks.
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][DOCS] Mention lack of RDD order preservation after deserialization

Reply via email to