Hello,

currently, we are trying to move one of our large scale batch jobs (~100TB
inputs) from our Euphoria <http://github.com/seznam/euphoria> based
SparkRunner to Beam's Spark runner and we came across the following issue.

Because we rely on hadoop ecosystem, we need to group outputs by
TaskAttemptID, in order to use OutputFormats based on FileOutputFormat.

We do this by using *GroupByKey*, but we came across the known problem,
that all values for any single key need to fit in-memory at once.

I did a quick research and I think that following needs to be addressed:
a) We can not use Spark's *groupByKey*, because it requires all values to
fit in memory for a single key (it is implemented as "list combiner")
b) *ReduceFnRunner* iterates over values multiple times in order to group
also by window

In Euphoria based runner, we solved this for *non-merging* windowing by
using Spark's *repartitionAndSortWithinPartitions*, where we sorted output
by key and window, so the output could be processed sequentially.

Did anyone run into the same issue? Is there currently any workaround for
this? How should we approach this?

Thanks,
David

Reply via email to