[
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705446#comment-14705446
]
Nicholas Chammas commented on SPARK-3533:
-----------------------------------------
{quote}
Nicholas Chammas Have you been able to take a look at the code?
{quote}
I'm unfortunately not in a good position to review the code and give the
appropriate feedback. Someone like [~davies] or [~srowen] may be able to do
that, but I can't speak for their availability.
{quote}
I'm not sure if you're suggesting it would be better to make a pull request
now, or whether the gist is sufficient. I will open a pull request if you
prefer. Is there anything else I should be doing to get committer buy-in?
{quote}
As a fellow contributor, I'm just advising that committer buy-in is essential
to getting a feature like this landed. To get that, you may need to risk
offering up a full solution knowing that it may be rejected or require many
changes before acceptance.
An alternative would be to get some pre-approval for the idea and guidance from
a committer (perhaps via the dev list) before crafting a full solution.
However, if the feature is not already a priority for some committer, this is
unlikely to happen.
I'm not sure what the right way to go is, but those are your options,
realistically.
> Add saveAsTextFileByKey() method to RDDs
> ----------------------------------------
>
> Key: SPARK-3533
> URL: https://issues.apache.org/jira/browse/SPARK-3533
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, Spark Core
> Affects Versions: 1.1.0
> Reporter: Nicholas Chammas
>
> Users often have a single RDD of key-value pairs that they want to save to
> multiple locations based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben',
> >>> 'Frankie']).keyBy(lambda x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so
> that I have one output directory per distinct key. Each output directory
> could potentially have multiple {{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of
> {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the
> {{MultipleTextOutputFormat}} output format class, it isn't straightforward.
> It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs
> that makes it easy to save RDDs out to multiple locations at once.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]