David Moravek created BEAM-8848:
-----------------------------------

             Summary: Flink: Memory efficient GBK implementation for batch 
runner
                 Key: BEAM-8848
                 URL: https://issues.apache.org/jira/browse/BEAM-8848
             Project: Beam
          Issue Type: Improvement
          Components: runner-flink
            Reporter: David Moravek
            Assignee: David Moravek


In current batch runner, all the values for a single key need to fit in memory, 
because the resulting GBK iterable is materialized using "List" data structure.

Implications:
- This blocks user from using custom sharding in most of the IOs, as the whole 
shard needs to fit in memory.
- Frequent OOM failures in case of skewed data (pipeline should be running slow 
instead of failing). This is super hard to debug for inexperienced user.

We can do way better for non-merging windows, the same way we do for Spark 
runner. Only drawback is, that this implementation does not support result 
re-iterations. We'll support turning this implementation on and off, if user 
needs to trade off reiterations for memory efficiency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to