[ https://issues.apache.org/jira/browse/BEAM-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Bradshaw updated BEAM-5392: ---------------------------------- Summary: GroupByKey on Spark: All values for a single key need to fit in-memory at once (was: GroupByKey: All values for a single key need to fit in-memory at once) > GroupByKey on Spark: All values for a single key need to fit in-memory at once > ------------------------------------------------------------------------------ > > Key: BEAM-5392 > URL: https://issues.apache.org/jira/browse/BEAM-5392 > Project: Beam > Issue Type: Bug > Components: runner-spark > Affects Versions: 2.6.0 > Reporter: David Moravek > Assignee: David Moravek > Priority: Major > Labels: performance > > Currently, when using GroupByKey, all values for a single key need to fit > in-memory at once. > > There are following issues, that need to be addressed: > a) We can not use Spark's _groupByKey_, because it requires all values to fit > in memory for a single key (it is implemented as "list combiner") > b) _ReduceFnRunner_ iterates over values multiple times in order to group > also by window > > Solution: > > In Dataflow Worker code, there are optimized versions of ReduceFnRunner, that > can take advantage of having elements for a single key sorted by timestamp. > > We can use Spark's `{{repartitionAndSortWithinPartitions}}` in order to meet > this constraint. > > For non-merging windows, we can put window itself into the key resulting in > smaller groupings. > > This approach was already tested in ~100TB input scale on Spark 2.3.x. > (custom Spark runner). > > I'll submit a patch once the Dataflow Worker code donation is complete. -- This message was sent by Atlassian JIRA (v7.6.3#76005)