[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

Ilya Ganelin (JIRA) Mon, 05 Jan 2015 13:19:28 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265122#comment-14265122
 ]


Ilya Ganelin commented on SPARK-3533:
-------------------------------------

Hi all - I have that solution (using MultipleTextOutputFormat) implemented but 
sadly it doesn't work out of the box.

saveAsHadoopFileByKey should generate a text file per key *** FAILED ***
  java.lang.RuntimeException: java.lang.NoSuchMethodException: 
org.apache.spark.rdd.PairRDDFunctions$RDDMultipleTextOutputFormat.<init>()
  at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)

Adding an init method to the definition does not help either so I'm still 
digging into other options. My code is here: 
https://github.com/ilganeli/spark/tree/SPARK-3533

I'll keep looking for alternatives. 

> Add saveAsTextFileByKey() method to RDDs
> ----------------------------------------
>
>                 Key: SPARK-3533
>                 URL: https://issues.apache.org/jira/browse/SPARK-3533
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Nicholas Chammas
>
> Users often have a single RDD of key-value pairs that they want to save to 
> multiple locations based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
> >>> 'Frankie']).keyBy(lambda x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so 
> that I have one output directory per distinct key. Each output directory 
> could potentially have multiple {{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of 
> {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
> {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
> It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
> that makes it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

Reply via email to