[ https://issues.apache.org/jira/browse/BEAM-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maximilian Michels updated BEAM-8848: ------------------------------------- Status: Open (was: Triage Needed) > Flink: Memory efficient GBK implementation for batch runner > ----------------------------------------------------------- > > Key: BEAM-8848 > URL: https://issues.apache.org/jira/browse/BEAM-8848 > Project: Beam > Issue Type: Improvement > Components: runner-flink > Reporter: David Moravek > Assignee: David Moravek > Priority: Major > > In current batch runner, all the values for a single key need to fit in > memory, because the resulting GBK iterable is materialized using "List" data > structure. > Implications: > - This blocks user from using custom sharding in most of the IOs, as the > whole shard needs to fit in memory. > - Frequent OOM failures in case of skewed data (pipeline should be running > slow instead of failing). This is super hard to debug for inexperienced user. > We can do way better for non-merging windows, the same way we do for Spark > runner. Only drawback is, that this implementation does not support result > re-iterations. We'll support turning this implementation on and off, if user > needs to trade off reiterations for memory efficiency. -- This message was sent by Atlassian Jira (v8.3.4#803005)