Hello, currently, we are trying to move one of our large scale batch jobs (~100TB inputs) from our Euphoria <http://github.com/seznam/euphoria> based SparkRunner to Beam's Spark runner and we came across the following issue.
Because we rely on hadoop ecosystem, we need to group outputs by TaskAttemptID, in order to use OutputFormats based on FileOutputFormat. We do this by using *GroupByKey*, but we came across the known problem, that all values for any single key need to fit in-memory at once. I did a quick research and I think that following needs to be addressed: a) We can not use Spark's *groupByKey*, because it requires all values to fit in memory for a single key (it is implemented as "list combiner") b) *ReduceFnRunner* iterates over values multiple times in order to group also by window In Euphoria based runner, we solved this for *non-merging* windowing by using Spark's *repartitionAndSortWithinPartitions*, where we sorted output by key and window, so the output could be processed sequentially. Did anyone run into the same issue? Is there currently any workaround for this? How should we approach this? Thanks, David
