[
https://issues.apache.org/jira/browse/BEAM-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Romanenko resolved BEAM-5392.
------------------------------------
Resolution: Fixed
Fix Version/s: 2.11.0
> GroupByKey on Spark: All values for a single key need to fit in-memory at once
> ------------------------------------------------------------------------------
>
> Key: BEAM-5392
> URL: https://issues.apache.org/jira/browse/BEAM-5392
> Project: Beam
> Issue Type: Bug
> Components: runner-spark
> Affects Versions: 2.6.0
> Reporter: David Moravek
> Assignee: David Moravek
> Priority: Major
> Labels: performance, triaged
> Fix For: 2.11.0
>
> Time Spent: 8h 50m
> Remaining Estimate: 0h
>
> Currently, when using GroupByKey, all values for a single key need to fit
> in-memory at once.
>
> There are following issues, that need to be addressed:
> a) We can not use Spark's _groupByKey_, because it requires all values to fit
> in memory for a single key (it is implemented as "list combiner")
> b) _ReduceFnRunner_ iterates over values multiple times in order to group
> also by window
>
> Solution:
>
> In Dataflow Worker code, there are optimized versions of ReduceFnRunner, that
> can take advantage of having elements for a single key sorted by timestamp.
>
> We can use Spark's `{{repartitionAndSortWithinPartitions}}` in order to meet
> this constraint.
>
> For non-merging windows, we can put window itself into the key resulting in
> smaller groupings.
>
> This approach was already tested in ~100TB input scale on Spark 2.3.x.
> (custom Spark runner).
>
> I'll submit a patch once the Dataflow Worker code donation is complete.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)