If you repartition($"column") and then do .write.partitionBy("column") you should end up with a single file for each value of the partition column.
On Mon, Dec 5, 2016 at 10:59 AM, Everett Anderson <ever...@nuna.com.invalid> wrote: > Hi, > > I have a DataFrame of records with dates, and I'd like to write all > 12-month (with overlap) windows to separate outputs. > > Currently, I have a loop equivalent to: > > for ((windowStart, windowEnd) <- windows) { > val windowData = allData.filter( > getFilterCriteria(windowStart, windowEnd)) > windowData.write.format(...).save(...) > } > > This works fine, but has the drawback that since Spark doesn't parallelize > the writes, there is a fairly cost based on the number of windows. > > Is there a way around this? > > In MapReduce, I'd probably multiply the data in a Mapper with a window ID > and then maybe use something like MultipleOutputs > <https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html>. > But I'm a bit worried of trying to do this in Spark because of the data > explosion and RAM use. What's the best approach? > > Thanks! > > - Everett > >