[jira] [Comment Edited] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

Silas Davis (JIRA) Thu, 20 Aug 2015 07:06:11 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704923#comment-14704923
 ]


Silas Davis edited comment on SPARK-3533 at 8/20/15 2:05 PM:
-------------------------------------------------------------

[~nchammas] I don't have implementations for Python or Java, other than that 
they could use the same OutputFormat to write multiple outputs using the 
current Spark API, however I'd be willing to try and put something together.

At this stage though I think it might be a bit premature for PR as I what I 
wrote deliberately works without changing existing Spark code, but I have a 
feeling that a more elegant solution might be reached by transplanting some 
code from MultipleOutputsFormat to PairRDDFunctions. Has anyone had a chance to 
grok what I'm doing in the gist? Would it be a good idea to parameterise 
saveAsNewAPIHadoopDataset so that it can use MultipleOutputs directly? I might 
see if I can work out something sensible along these lines and we can compare 
the approaches.

Another thing is that I've only just noticed that this ticket referes to 
saveAsTextFileByKey, in my gist you can see I have also have variants for 
saveAsMultipleAvroFiles and saveAsMultipleParquetFiles using the same approach, 
we don't have to include these specific helpers but I think we should 
generalise this for multiple outputs for any OutputFormat. Can we expand this 
ticket, or should I open a new one?

[~saurfang] with reference to what I've said above I think it would be better 
to provide a solution for any type of multiple outputs, not just text, as they 
lend themselves to a unified approach and we might as well kill n birds with 
one stone. Also Hadoop 2 has no equivalent of MultipleTextOutputFormat, whereas 
Hadoop 1 does have a MultipleOutputs class which seems largely similar, so I 
think we can use an approach involving MultipleOuputs for both Hadoop 1 and 2. 
So my personal opinion would be to not put together a PR based on 
MultipleTextOutputFormat. I would welcome your assistance on a more general 
PR/comments on the approach though.


was (Author: silasdavis):
[~nchammas] I don't have implementations for Python or Java, other than that 
they could use the same OutputFormat to write multiple outputs using the 
current Spark API, however I'd be willing to try and put something together.

At this stage though I think it might be a bit premature for PR as I what I 
wrote deliberately works without changing existing Spark code, but my feeling 
is that a more elegant solution could be reached by implanting some of the code 
I have in MultipleOutputsFormat into code in PairRDDFunctions. Has anyone had a 
chance to grok what I'm doing in the gist? Would it be a good idea to 
parameterise saveAsNewAPIHadoopDataset so that it can use MultipleOutputs? I 
might see if I can work out something sensible along these lines and we can 
compare the approaches.

Another thing is that I've only just noticed that this ticket referes to 
saveAsTextFileByKey, in my gist you can see I have also have variants for 
saveAsMultipleAvroFiles and saveAsMultipleParquetFiles using the same approach, 
we don't have to include these specific helpers but I think we should 
generalise this for multiple outputs for any OutputFormat. Can we expand this 
ticket, or should I open a new one?

[~saurfang] with reference to what I've said above I think it would be better 
to provide a solution for any type of multiple outputs, not just text, as they 
lend themselves to a unified approach and we might as well kill n birds with 
one stone. Also Hadoop 2 has no equivalent of MultipleTextOutputFormat, whereas 
Hadoop 1 does have a MultipleOutputs class which seems largely similar, so I 
think we can use the same approach for Hadoop 1 and 2 if we use an approach 
involving MultipleOuputs. So my personal opinion would be to not put together a 
PR based on MultipleTextOutputFormat. I would welcome your assistance on a more 
general PR/comments on the approach though.

> Add saveAsTextFileByKey() method to RDDs
> ----------------------------------------
>
>                 Key: SPARK-3533
>                 URL: https://issues.apache.org/jira/browse/SPARK-3533
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Nicholas Chammas
>
> Users often have a single RDD of key-value pairs that they want to save to 
> multiple locations based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
> >>> 'Frankie']).keyBy(lambda x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so 
> that I have one output directory per distinct key. Each output directory 
> could potentially have multiple {{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of 
> {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
> {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
> It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
> that makes it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

Reply via email to