GitHub user saurfang opened a pull request:
https://github.com/apache/spark/pull/8375
[SPARK-3533][Core] Add saveAsTextFileByKey() method to RDDs
This adds the functionality of saving a `RDD[(K, V)]` to multiple text
files split by key. It covers Scala/Java/Python API.
This is based on the Stackoverflow answer linked in the original JIRA and
https://github.com/apache/spark/pull/4895.
Furthermore, I have fixed an issue where part filenames were not included
in the paths causing executors to override each other. Tests verify written
contents.
I'm very intrigued by @silasdavis's approach to use `MultipleOutputs`
providing a more generic interface to write arbitrary outputs by key:
https://gist.github.com/silasdavis/d1d1f1f7ab78249af462
However I have been unsuccessful porting it to work with Hadoop 1 mapred
API.
Therefore I'm put forth this PR that achieves the simpler but highly
requested goal of writing text files only.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/saurfang/spark multitextouput
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/8375.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #8375
----
commit 4df1e0fe5804f27c842d7dd7192aa304c3b4a96a
Author: Forest Fang <[email protected]>
Date: 2015-08-22T19:17:07Z
Add saveAsTextFileByKey for PairRDD and JavaPairRDD
commit 43ad218a9745fbcfe7bede5686f45817e1f192a0
Author: Forest Fang <[email protected]>
Date: 2015-08-22T20:23:29Z
Python implementation
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]