[
https://issues.apache.org/jira/browse/BEAM-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293058#comment-16293058
]
Eugene Kirpichov commented on BEAM-3353:
----------------------------------------
An analysis of existing Dataflow job graphs involving this pattern has shown
that a majority of them are safe in practice, because they use one of the
following:
- the WriteFiles transform, where duplicate firings would lead only to
duplicate attempts to rename temp files that were already renamed (no-op)
- Combine with hot key fanout, which resets the windowing strategy explicitly
instead of using a Window.configure() transform
- Some use the AfterWatermark trigger with an allowed lateness of 0, which is
_the only_ trigger that is safe to use with stacked GBKs in accumulating mode.
This is very rare - usually people specify allowed lateness, which makes it
unsafe.
Others (a very small number) are unsafe, and it would do well to the customers
to prohibit this. However before we can do that, we need to fix WriteFiles and
Combine with hot key fanout transforms to not cause a violation of this rule.
Luckily WriteFiles has already been fixed (it no longer involves a chained
GroupByKey). Remains just to fix Combine, and then prohibit this pattern.
> Prohibit stacked GBKs with accumulating mode
> --------------------------------------------
>
> Key: BEAM-3353
> URL: https://issues.apache.org/jira/browse/BEAM-3353
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-core, sdk-py-core
> Reporter: Eugene Kirpichov
> Assignee: Eugene Kirpichov
>
> The following test https://github.com/apache/beam/pull/4239 demonstrates that
> stacked GBKs with accumulating mode are unsafe, the same way that stacked
> GBKs with merging windows are unsafe.
> In particular, in the pipeline: input -> (gbk onto N keys) -> ungroup -> (gbk
> onto 1 key) -> ungroup, e.g. suppose the first gbk receives "a" and then "b";
> it will emit "a" and then "a","b" - then the second gbk will emit "a" and
> then "a","a","b" which is meaningless. With combine instead of GBK, it leads
> to double-counting.
> There are cases where accumulation propagated through stacked aggregation can
> be desirable, but having it propagate by default is definitely the wrong
> thing to do. Silently changing it to discarding is likely also the wrong
> thing to do. So, we should reset the windowing strategy and force the user to
> specify accumulating mode explicitly if they would like to.
> All pipelines using this currently are computing meaningless results, so
> rejecting them should not be considered a breaking change. However, we should
> still find out whether there are a lot of such pipelines or not.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)