If you repartition($"column") and then do .write.partitionBy("column") you
should end up with a single file for each value of the partition column.

On Mon, Dec 5, 2016 at 10:59 AM, Everett Anderson <ever...@nuna.com.invalid>
wrote:

> Hi,
>
> I have a DataFrame of records with dates, and I'd like to write all
> 12-month (with overlap) windows to separate outputs.
>
> Currently, I have a loop equivalent to:
>
> for ((windowStart, windowEnd) <- windows) {
>     val windowData = allData.filter(
>         getFilterCriteria(windowStart, windowEnd))
>     windowData.write.format(...).save(...)
> }
>
> This works fine, but has the drawback that since Spark doesn't parallelize
> the writes, there is a fairly cost based on the number of windows.
>
> Is there a way around this?
>
> In MapReduce, I'd probably multiply the data in a Mapper with a window ID
> and then maybe use something like MultipleOutputs
> <https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html>.
> But I'm a bit worried of trying to do this in Spark because of the data
> explosion and RAM use. What's the best approach?
>
> Thanks!
>
> - Everett
>
>

Reply via email to