Eugene Kirpichov created BEAM-3353:
--------------------------------------

             Summary: Prohibit stacked GBKs with accumulating mode
                 Key: BEAM-3353
                 URL: https://issues.apache.org/jira/browse/BEAM-3353
             Project: Beam
          Issue Type: Bug
          Components: sdk-java-core, sdk-py-core
            Reporter: Eugene Kirpichov
            Assignee: Eugene Kirpichov


The following test https://github.com/apache/beam/pull/4239 demonstrates that 
stacked GBKs with accumulating mode are unsafe, the same way that stacked GBKs 
with merging windows are unsafe.

In particular, in the pipeline: input -> (gbk onto N keys) -> ungroup -> (gbk 
onto 1 key) -> ungroup, e.g. suppose the first gbk receives "a" and then "b"; 
it will emit "a" and then "a","b" - then the second gbk will emit "a" and then 
"a","a","b" which is meaningless. With combine instead of GBK, it leads to 
double-counting.

There are cases where accumulation propagated through stacked aggregation can 
be desirable, but having it propagate by default is definitely the wrong thing 
to do. Silently changing it to discarding is likely also the wrong thing to do. 
So, we should reset the windowing strategy and force the user to specify 
accumulating mode explicitly if they would like to.

All pipelines using this currently are computing meaningless results, so 
rejecting them should not be considered a breaking change. However, we should 
still find out whether there are a lot of such pipelines or not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to