[
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704923#comment-14704923
]
Silas Davis edited comment on SPARK-3533 at 8/20/15 2:05 PM:
-------------------------------------------------------------
[~nchammas] I don't have implementations for Python or Java, other than that
they could use the same OutputFormat to write multiple outputs using the
current Spark API, however I'd be willing to try and put something together.
At this stage though I think it might be a bit premature for PR as I what I
wrote deliberately works without changing existing Spark code, but I have a
feeling that a more elegant solution might be reached by transplanting some
code from MultipleOutputsFormat to PairRDDFunctions. Has anyone had a chance to
grok what I'm doing in the gist? Would it be a good idea to parameterise
saveAsNewAPIHadoopDataset so that it can use MultipleOutputs directly? I might
see if I can work out something sensible along these lines and we can compare
the approaches.
Another thing is that I've only just noticed that this ticket referes to
saveAsTextFileByKey, in my gist you can see I have also have variants for
saveAsMultipleAvroFiles and saveAsMultipleParquetFiles using the same approach,
we don't have to include these specific helpers but I think we should
generalise this for multiple outputs for any OutputFormat. Can we expand this
ticket, or should I open a new one?
[~saurfang] with reference to what I've said above I think it would be better
to provide a solution for any type of multiple outputs, not just text, as they
lend themselves to a unified approach and we might as well kill n birds with
one stone. Also Hadoop 2 has no equivalent of MultipleTextOutputFormat, whereas
Hadoop 1 does have a MultipleOutputs class which seems largely similar, so I
think we can use an approach involving MultipleOuputs for both Hadoop 1 and 2.
So my personal opinion would be to not put together a PR based on
MultipleTextOutputFormat. I would welcome your assistance on a more general
PR/comments on the approach though.
was (Author: silasdavis):
[~nchammas] I don't have implementations for Python or Java, other than that
they could use the same OutputFormat to write multiple outputs using the
current Spark API, however I'd be willing to try and put something together.
At this stage though I think it might be a bit premature for PR as I what I
wrote deliberately works without changing existing Spark code, but my feeling
is that a more elegant solution could be reached by implanting some of the code
I have in MultipleOutputsFormat into code in PairRDDFunctions. Has anyone had a
chance to grok what I'm doing in the gist? Would it be a good idea to
parameterise saveAsNewAPIHadoopDataset so that it can use MultipleOutputs? I
might see if I can work out something sensible along these lines and we can
compare the approaches.
Another thing is that I've only just noticed that this ticket referes to
saveAsTextFileByKey, in my gist you can see I have also have variants for
saveAsMultipleAvroFiles and saveAsMultipleParquetFiles using the same approach,
we don't have to include these specific helpers but I think we should
generalise this for multiple outputs for any OutputFormat. Can we expand this
ticket, or should I open a new one?
[~saurfang] with reference to what I've said above I think it would be better
to provide a solution for any type of multiple outputs, not just text, as they
lend themselves to a unified approach and we might as well kill n birds with
one stone. Also Hadoop 2 has no equivalent of MultipleTextOutputFormat, whereas
Hadoop 1 does have a MultipleOutputs class which seems largely similar, so I
think we can use the same approach for Hadoop 1 and 2 if we use an approach
involving MultipleOuputs. So my personal opinion would be to not put together a
PR based on MultipleTextOutputFormat. I would welcome your assistance on a more
general PR/comments on the approach though.
> Add saveAsTextFileByKey() method to RDDs
> ----------------------------------------
>
> Key: SPARK-3533
> URL: https://issues.apache.org/jira/browse/SPARK-3533
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, Spark Core
> Affects Versions: 1.1.0
> Reporter: Nicholas Chammas
>
> Users often have a single RDD of key-value pairs that they want to save to
> multiple locations based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben',
> >>> 'Frankie']).keyBy(lambda x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so
> that I have one output directory per distinct key. Each output directory
> could potentially have multiple {{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of
> {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the
> {{MultipleTextOutputFormat}} output format class, it isn't straightforward.
> It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs
> that makes it easy to save RDDs out to multiple locations at once.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]