[
https://issues.apache.org/jira/browse/BEAM-13628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Burke updated BEAM-13628:
--------------------------------
Description:
It's been determined the documentation in the proto was a a bit buggy WRT Side
input semantics. Previous to https://github.com/apache/beam/pull/16474 it said
state cache tokens are globally unique, however, in implementation and the
original design they are unique WRT their associated StateKeys.
This means the Go SDK's side input cache is broken as delivered, and can cause
a correctness issue when there are multiple distinct side inputs, of the same
type. The mitigation is to not use the SideInput cache in affected versions
(2.35.0). The cache is off by default.
The correction will use the whole state key (which, for side inputs includes
the transformID ,SideInputID) tuple (with a user key if it's a multimap side
input)), along with the Runner provided token.
Since this can at worst cause a data correctness issue rather than a pipeline
failure, this should be part of the 2.36.0 release. We may wish to backport it
to a 2.35.1 patch release, only for the Go SDK to close the gap as well.
was:
It's been determined the documentation in the proto was a a bit buggy WRT Side
input semantics. Previous to https://github.com/apache/beam/pull/16474 it said
state cache tokens are globally unique, however, in implementation and the
original design they are unique WRT their associated StateKeys.
This means the Go SDK's side input cache is broken as delivered, and can cause
a correctness issue when there are multiple distinct side inputs.
The correction will use the whole state key (which, for side inputs includes
the transformID ,SideInputID) tuple (with a user key if it's a multimap side
input)), along with the Runner provided token.
> [Go SDK] Make Side input cache fit resolved semantics.
> ------------------------------------------------------
>
> Key: BEAM-13628
> URL: https://issues.apache.org/jira/browse/BEAM-13628
> Project: Beam
> Issue Type: Bug
> Components: sdk-go
> Affects Versions: 2.35.0
> Reporter: Robert Burke
> Assignee: Jack McCluskey
> Priority: P2
> Fix For: 2.36.0
>
>
> It's been determined the documentation in the proto was a a bit buggy WRT
> Side input semantics. Previous to https://github.com/apache/beam/pull/16474
> it said state cache tokens are globally unique, however, in implementation
> and the original design they are unique WRT their associated StateKeys.
> This means the Go SDK's side input cache is broken as delivered, and can
> cause a correctness issue when there are multiple distinct side inputs, of
> the same type. The mitigation is to not use the SideInput cache in affected
> versions (2.35.0). The cache is off by default.
> The correction will use the whole state key (which, for side inputs includes
> the transformID ,SideInputID) tuple (with a user key if it's a multimap side
> input)), along with the Runner provided token.
> Since this can at worst cause a data correctness issue rather than a pipeline
> failure, this should be part of the 2.36.0 release. We may wish to backport
> it to a 2.35.1 patch release, only for the Go SDK to close the gap as well.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)