[ https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicholas Chammas updated SPARK-3533: ------------------------------------ Description: Users often have a single RDD of key-value pairs that they want to save to multiple locations based on the keys. For example, say I have an RDD like this: {code} >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda >>> x: x[0]) >>> a.collect() [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] >>> a.keys().distinct().collect() ['B', 'F', 'N'] {code} Now I want to write the RDD out to different paths depending on the keys, so that I have one output directory per distinct key. Each output directory could potentially have multiple {{part-}} files, one per RDD partition. So the output would look something like: {code} /path/prefix/B [/part-1, /part-2, etc] /path/prefix/F [/part-1, /part-2, etc] /path/prefix/N [/part-1, /part-2, etc] {code} Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't straightforward. It's not clear if it's even possible at all in PySpark. Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes it easy to save RDDs out to multiple locations at once. --- Update: March 2016 There are two workarounds to this problem: 1. See [this answer on Stack Overflow|http://stackoverflow.com/a/26051042/877069], which implements {{MultipleTextOutputFormat}}. (Scala-only) 2. See [this comment by Davies Liu|https://github.com/apache/spark/pull/8375#issuecomment-202458325], which uses DataFrames: {code} val df = rdd.map(t => Row(gen_key(t), t)).toDF("key", "text") df.write.partitionBy("key").text(path){code} was: Users often have a single RDD of key-value pairs that they want to save to multiple locations based on the keys. For example, say I have an RDD like this: {code} >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda >>> x: x[0]) >>> a.collect() [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] >>> a.keys().distinct().collect() ['B', 'F', 'N'] {code} Now I want to write the RDD out to different paths depending on the keys, so that I have one output directory per distinct key. Each output directory could potentially have multiple {{part-}} files, one per RDD partition. So the output would look something like: {code} /path/prefix/B [/part-1, /part-2, etc] /path/prefix/F [/part-1, /part-2, etc] /path/prefix/N [/part-1, /part-2, etc] {code} Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't straightforward. It's not clear if it's even possible at all in PySpark. Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes it easy to save RDDs out to multiple locations at once. > Add saveAsTextFileByKey() method to RDDs > ---------------------------------------- > > Key: SPARK-3533 > URL: https://issues.apache.org/jira/browse/SPARK-3533 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core > Affects Versions: 1.1.0 > Reporter: Nicholas Chammas > > Users often have a single RDD of key-value pairs that they want to save to > multiple locations based on the keys. > For example, say I have an RDD like this: > {code} > >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', > >>> 'Frankie']).keyBy(lambda x: x[0]) > >>> a.collect() > [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] > >>> a.keys().distinct().collect() > ['B', 'F', 'N'] > {code} > Now I want to write the RDD out to different paths depending on the keys, so > that I have one output directory per distinct key. Each output directory > could potentially have multiple {{part-}} files, one per RDD partition. > So the output would look something like: > {code} > /path/prefix/B [/part-1, /part-2, etc] > /path/prefix/F [/part-1, /part-2, etc] > /path/prefix/N [/part-1, /part-2, etc] > {code} > Though it may be possible to do this with some combination of > {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the > {{MultipleTextOutputFormat}} output format class, it isn't straightforward. > It's not clear if it's even possible at all in PySpark. > Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs > that makes it easy to save RDDs out to multiple locations at once. > --- > Update: March 2016 > There are two workarounds to this problem: > 1. See [this answer on Stack > Overflow|http://stackoverflow.com/a/26051042/877069], which implements > {{MultipleTextOutputFormat}}. (Scala-only) > 2. See [this comment by Davies > Liu|https://github.com/apache/spark/pull/8375#issuecomment-202458325], which > uses DataFrames: > {code} > val df = rdd.map(t => Row(gen_key(t), t)).toDF("key", "text") > df.write.partitionBy("key").text(path){code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org