@Reynold Xin: not really: it only works for Parquet (see partitionBy: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter), it requires you to have a DataFrame in the first place (for my use case the spark sql interface to avro records is more of a hinderance than a help - since I want to use generated java classes rather than treat avro records as generic tables (via Rows)), and even if I do have a DataFrame, if I need to map or mapPartitions I lose that interface and have to create a new DataFrame from a RDD[Row], which isn't very convenient or efficient.
Has anyone been able to take a look at my gist: https://gist.github.com/silasdavis/d1d1f1f7ab78249af462? The first 100 lines provides a base class for MutltipleOutputsFormats, then see line 269 for an example of how to use such an OutputFormat: https://gist.github.com/silasdavis/d1d1f1f7ab78249af462#file-multipleoutputs-scala-L269 @Alex Angelini, the code there would support your use case without modifying spark (it uses saveAsNewApiHadoopFile and a multiple outputs wrapper format). @Nicholas Chammas, I'll post a link to my gist on that ticket. On Fri, 14 Aug 2015 at 21:10 Reynold Xin <r...@databricks.com> wrote: > This is already supported with the new partitioned data sources in > DataFrame/SQL right? > > > On Fri, Aug 14, 2015 at 8:04 AM, Alex Angelini <alex.angel...@shopify.com> > wrote: > >> Speaking about Shopify's deployment, this would be a really nice to have >> feature. >> >> We would like to write data to folders with the structure >> `<year>/<month>/<day>` but have had to hold off on that because of the lack >> of support for MultipleOutputs. >> >> On Fri, Aug 14, 2015 at 10:56 AM, Silas Davis <si...@silasdavis.net> >> wrote: >> >>> Would it be right to assume that the silence on this topic implies >>> others don't really have this issue/desire? >>> >>> On Sat, 18 Jul 2015 at 17:24 Silas Davis <si...@silasdavis.net> wrote: >>> >>>> *tl;dr hadoop and cascading* *provide ways of writing tuples to >>>> multiple output files based on key, but the plain RDD interface doesn't >>>> seem to and it should.* >>>> >>>> I have been looking into ways to write to multiple outputs in Spark. It >>>> seems like a feature that is somewhat missing from Spark. >>>> >>>> The idea is to partition output and write the elements of an RDD to >>>> different locations depending based on the key. For example in a pair RDD >>>> your key may be (language, date, userId) and you would like to write >>>> separate files to $someBasePath/$language/$date. Then there would be a >>>> version of saveAsHadoopDataset that would be able to multiple location >>>> based on key using the underlying OutputFormat. Perahps it would take a >>>> pair RDD with keys ($partitionKey, $realKey), so for example ((language, >>>> date), userId). >>>> >>>> The prior art I have found on this is the following. >>>> >>>> Using SparkSQL: >>>> The 'partitionBy' method of DataFrameWriter: >>>> https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter >>>> >>>> This only works for parquet at the moment. >>>> >>>> Using Spark/Hadoop: >>>> This pull request (with the hadoop1 API,) : >>>> https://github.com/apache/spark/pull/4895/files. >>>> >>>> This uses MultipleTextOutputFormat (which in turn uses >>>> MultipleOutputFormat) which is part of the old hadoop1 API. It only works >>>> for text but could be generalised for any underlying OutputFormat by using >>>> MultipleOutputFormat (but only for hadoop1 - which doesn't support >>>> ParquetAvroOutputFormat for example) >>>> >>>> This gist (With the hadoop2 API): >>>> https://gist.github.com/mlehman/df9546f6be2e362bbad2 >>>> >>>> This uses MultipleOutputs (available for both the old and new hadoop >>>> APIs) and extends saveAsNewHadoopDataset to support multiple outputs. >>>> Should work for any underlying OutputFormat. Probably better implemented by >>>> extending saveAs[NewAPI]HadoopDataset. >>>> >>>> In Cascading: >>>> Cascading provides PartititionTap: >>>> http://docs.cascading.org/cascading/2.5/javadoc/cascading/tap/local/PartitionTap.html >>>> to do this >>>> >>>> So my questions are: is there a reason why Spark doesn't provide this? >>>> Does Spark provide similar functionality through some other mechanism? How >>>> would it be best implemented? >>>> >>>> Since I started composing this message I've had a go at writing an >>>> wrapper OutputFormat that writes multiple outputs using hadoop >>>> MultipleOutputs but doesn't require modification of the PairRDDFunctions. >>>> The principle is similar however. Again it feels slightly hacky to use >>>> dummy fields for the ReduceContextImpl, but some of this may be a part of >>>> the impedance mismatch between Spark and plain Hadoop... Here is my >>>> attempt: https://gist.github.com/silasdavis/d1d1f1f7ab78249af462 >>>> >>>> I'd like to see this functionality in Spark somehow but invite >>>> suggestion of how best to achieve it. >>>> >>>> Thanks, >>>> Silas >>>> >>> >> >