[GitHub] [spark] wetneb opened a new pull request #28465: [SPARK-5300][CORE][DOCS] Mention lack of RDD order preservation after deserialization

GitBox Wed, 06 May 2020 09:43:03 -0700


wetneb opened a new pull request #28465:
URL: https://github.com/apache/spark/pull/28465

### What changes were proposed in this pull request?

This changes the docs to make it clearer that order preservation is not
guaranteed when saving a RDD to disk and reading it back
([SPARK-5300](https://issues.apache.org/jira/browse/SPARK-5300)).

I added two sentences about this in the RDD Programming Guide.

The issue was discussed on the dev mailing list:

http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html

### Why are the changes needed?

Because RDDs are order-aware collections, it is natural to expect that if I
use if I use `saveAsTextFile` and then load the resulting file with
`sparkContext.textFile`, I obtain a dataset in the same order.

This is unfortunately not the case at the moment and there is no agreed upon
way to fix this in Spark itself (see PR #4204 which attempted to fix this).
Users should be aware of this.

### Does this PR introduce _any_ user-facing change?

Yes, two new sentences in the documentation.

### How was this patch tested?

By checking that the documentation looks good.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] wetneb opened a new pull request #28465: [SPARK-5300][CORE][DOCS] Mention lack of RDD order preservation after deserialization

Reply via email to