lostluck opened a new issue, #32245: URL: https://github.com/apache/beam/issues/32245
### What happened? When using the Beam State API in the Go SDK, it's possible that data may become corrupted when performing a Write. A change to the Go compiler for Go 1.23+ made the issue more frequent/severe/reproducible. It is *not* recommended to use Go 1.23+ for Beam Go pipelines using the State API until this issue is resolved. If your job doesn't use the State API, then it is unaffected. It's not known if the behavior can cause a similar issue elsewhere in the SDK at this time. Other calls such as writing data to data sinks are first written to a buffer, or properly copied to a buffer, avoiding the same problem, or at least rendering it less likely. ------- In particular, the data written to the state API, might become *overwritten or zeroed out* before actually being sent over the GRPC channel, corrupting the write. Consider a length prefix byte that is zeroed, or changed prior to being sent to the runner. --------- The root cause of the issue is that Beam Go is incorrectly implementing go's https://pkg.go.dev/io#Writer interface for writing to State calls. In particular, it's incorrectly retaining a reference to the input byte buffer, passed into the write call. https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/harness/statemgr.go#L481 The buffer is simply passed to the FnAPI StateAPI write request, then being sent to another channel for actually sending it on the State channel. https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/harness/statemgr.go#L498 This send happens asynchronously, sending it to a different goroutine for serialization. https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/harness/statemgr.go#L704 While this is ultimately a blocking call, the other half is due to Beam Go lightly subverting Go's escape analysis in the name of performance, using [ioutilx.WriteUnsafe](https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/util/ioutilx/write.go#L27). eg used here. https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/graph/coder/varint.go#L44 This was deemed required for performance for small writes in order to avoid usually unnecessary allocations to the heap. Typically these allocations would be stack allocated, and then not re-used. However in Go 1.23 a change to the compiler "can now overlap stack frame slots of local variables". It's my suspicion that this allowed the stack frame referred to to by the byte buffer to be zeroed out or re-assigned, corrupting the data. ------ The fix is to copy the bytes properly when sending them over the state API. This avoids the bytes being changed prior to being sent. A more robust fix would be to re-write coder handling entirely, to avoid using the io.Reader and io.Writer interfaces, which opened up the issue in the first place. ### Issue Priority Priority: 1 (data loss / total loss of function) ### Issue Components - [ ] Component: Python SDK - [ ] Component: Java SDK - [X] Component: Go SDK - [ ] Component: Typescript SDK - [ ] Component: IO connector - [ ] Component: Beam YAML - [ ] Component: Beam examples - [ ] Component: Beam playground - [ ] Component: Beam katas - [ ] Component: Website - [ ] Component: Infrastructure - [ ] Component: Spark Runner - [ ] Component: Flink Runner - [ ] Component: Samza Runner - [ ] Component: Twister2 Runner - [ ] Component: Hazelcast Jet Runner - [ ] Component: Google Cloud Dataflow Runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
