If you write to parquet you can use the partitionBy option which would write under a directory for each value of the column (assuming you have a column with the month).
From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Tuesday, December 06, 2016 3:33 AM To: Everett Anderson Cc: user Subject: Re: Writing DataFrame filter results to separate files 1. In my case, I'd need to first explode my data by ~12x to assign each record to multiple 12-month rolling output windows. I'm not sure Spark SQL would be able to optimize this away, combining it with the output writing to do it incrementally. You are right, but I wouldn't worry about the RAM use. If implemented properly (or if you just use the builtin window<https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#window(org.apache.spark.sql.Column,%20java.lang.String,%20java.lang.String)> function), it should all be pipelined. 2. Wouldn't each partition -- window in my case -- be shuffled to a single machine and then written together as one output shard? For a large amount of data per window, that seems less than ideal. Oh sorry, I thought you wanted one file per value. If you drop the repartition then it won't shuffle, but will just write in parallel on each machine.