If you write to parquet you can use the partitionBy option which would write 
under a directory for each value of the column (assuming you have a column with 
the month).

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Tuesday, December 06, 2016 3:33 AM
To: Everett Anderson
Cc: user
Subject: Re: Writing DataFrame filter results to separate files

1. In my case, I'd need to first explode my data by ~12x to assign each record 
to multiple 12-month rolling output windows. I'm not sure Spark SQL would be 
able to optimize this away, combining it with the output writing to do it 
incrementally.

You are right, but I wouldn't worry about the RAM use.  If implemented properly 
(or if you just use the builtin 
window<https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#window(org.apache.spark.sql.Column,%20java.lang.String,%20java.lang.String)>
 function), it should all be pipelined.

2. Wouldn't each partition -- window in my case -- be shuffled to a single 
machine and then written together as one output shard? For a large amount of 
data per window, that seems less than ideal.

Oh sorry, I thought you wanted one file per value.  If you drop the repartition 
then it won't shuffle, but will just write in parallel on each machine.

Reply via email to