David Moravek created BEAM-5392:
-----------------------------------
Summary: GroupByKey: All values for a single key need to fit
in-memory at once
Key: BEAM-5392
URL: https://issues.apache.org/jira/browse/BEAM-5392
Project: Beam
Issue Type: Bug
Components: runner-spark
Affects Versions: 2.6.0
Reporter: David Moravek
Assignee: David Moravek
Currently, when using GroupByKey, all values for a single key need to fit
in-memory at once.
There are following issues, that need to be addressed:
a) We can not use Spark's _groupByKey_, because it requires all values to fit
in memory for a single key (it is implemented as "list combiner")
b) _ReduceFnRunner_ iterates over values multiple times in order to group also
by window
Solution:
In Dataflow Worker code, there are optimized versions of ReduceFnRunner, that
can take advantage of having elements for a single key sorted by timestamp.
We can use Spark's `{{repartitionAndSortWithinPartitions}}` in order to meet
this constraint.
For non-merging windows, we can put window itself into the key resulting in
smaller groupings.
This approach was already tested in ~100TB input scale on Spark 2.3.x. (custom
Spark runner).
I'll submit a patch once the Dataflow Worker code donation is complete.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)