[ https://issues.apache.org/jira/browse/BEAM-11097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anonymous updated BEAM-11097: ----------------------------- Status: Triage Needed (was: Resolved) > [Go SDK] Windowed Side Input Caching (Cross Bundle) > --------------------------------------------------- > > Key: BEAM-11097 > URL: https://issues.apache.org/jira/browse/BEAM-11097 > Project: Beam > Issue Type: Improvement > Components: sdk-go > Reporter: Robert Burke > Assignee: Jack McCluskey > Priority: P3 > Labels: starter > Fix For: 2.35.0 > > Time Spent: 17h > Remaining Estimate: 0h > > This is implementing https://issues.apache.org/jira/browse/BEAM-5428 for the > Go SDK. > Side inputs are valid for given window. That means they can be and correctly > cached and re-used in multiple bundles if the data has already been loaded > into a given worker. > Side input data for a given bundle is specified and keyed by the window and a > a side input request token. The Go SDK presently doesn't use this information > for a SDK side worker cache of data. > Care needs to be taken to allow the data to be garbage collected if it hasn't > been used in the last minute or so. In practice, Global Window side inputs > will not be evicted while still in use, but infrequently accessed data will > possibly be evicted more regularly. A limit should be set to avoid running > out of memory. > As presently implemented, each ProcessElement call (or StartBundle or > FinishBundle) will only query for side input data on user request. However, > results of user requests are not cached between elements or bundles, which > means there's additional lookup and serialization time from the runner side. > It may be possible to re-use allocated side input elements, but even a > solution that re-decodes data on every call would be an improvement if the > data isn't re-streamed from the runner. > If https://issues.apache.org/jira/browse/BEAM-3293 has been implemented, then > per-key+window caching should additionally be considered for more granular > caching of data. > This optimization has an outsized impact on streaming performance as side > input data can change per bundle, and streaming bundles are typically small, > and quickly processed. > This work would be implemented in the exec package: > > [https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/sideinput.go] > > and probably the harness package > > [https://github.com/apache/beam/blob/c185adb7652034b5d044aae65a1a972d9ceb4377/sdks/go/pkg/beam/core/runtime/harness/statemgr.go#L50] -- This message was sent by Atlassian Jira (v8.20.10#820010)