GitHub user saurfang opened a pull request:

    https://github.com/apache/spark/pull/8375

    [SPARK-3533][Core] Add saveAsTextFileByKey() method to RDDs

    This adds the functionality of saving a `RDD[(K, V)]` to multiple text 
files split by key. It covers Scala/Java/Python API. 
    
    This is based on the Stackoverflow answer linked in the original JIRA and  
https://github.com/apache/spark/pull/4895. 
    Furthermore, I have fixed an issue where part filenames were not included 
in the paths causing executors to override each other. Tests verify written 
contents.
    
    I'm very intrigued by @silasdavis's approach to use `MultipleOutputs` 
providing a more generic interface to write arbitrary outputs by key: 
https://gist.github.com/silasdavis/d1d1f1f7ab78249af462
    However I have been unsuccessful porting it to work with Hadoop 1 mapred 
API.
    
    Therefore I'm put forth this PR that achieves the simpler but highly 
requested goal of writing text files only.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/saurfang/spark multitextouput

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8375.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8375
    
----
commit 4df1e0fe5804f27c842d7dd7192aa304c3b4a96a
Author: Forest Fang <[email protected]>
Date:   2015-08-22T19:17:07Z

    Add saveAsTextFileByKey for PairRDD and JavaPairRDD

commit 43ad218a9745fbcfe7bede5686f45817e1f192a0
Author: Forest Fang <[email protected]>
Date:   2015-08-22T20:23:29Z

    Python implementation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to