Eugene Kirpichov created BEAM-3353:
--------------------------------------
Summary: Prohibit stacked GBKs with accumulating mode
Key: BEAM-3353
URL: https://issues.apache.org/jira/browse/BEAM-3353
Project: Beam
Issue Type: Bug
Components: sdk-java-core, sdk-py-core
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov
The following test https://github.com/apache/beam/pull/4239 demonstrates that
stacked GBKs with accumulating mode are unsafe, the same way that stacked GBKs
with merging windows are unsafe.
In particular, in the pipeline: input -> (gbk onto N keys) -> ungroup -> (gbk
onto 1 key) -> ungroup, e.g. suppose the first gbk receives "a" and then "b";
it will emit "a" and then "a","b" - then the second gbk will emit "a" and then
"a","a","b" which is meaningless. With combine instead of GBK, it leads to
double-counting.
There are cases where accumulation propagated through stacked aggregation can
be desirable, but having it propagate by default is definitely the wrong thing
to do. Silently changing it to discarding is likely also the wrong thing to do.
So, we should reset the windowing strategy and force the user to specify
accumulating mode explicitly if they would like to.
All pipelines using this currently are computing meaningless results, so
rejecting them should not be considered a breaking change. However, we should
still find out whether there are a lot of such pipelines or not.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)