[ 
https://issues.apache.org/jira/browse/BEAM-11097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anonymous updated BEAM-11097:
-----------------------------
    Status: Triage Needed  (was: Resolved)

> [Go SDK] Windowed Side Input Caching (Cross Bundle)
> ---------------------------------------------------
>
>                 Key: BEAM-11097
>                 URL: https://issues.apache.org/jira/browse/BEAM-11097
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-go
>            Reporter: Robert Burke
>            Assignee: Jack McCluskey
>            Priority: P3
>              Labels: starter
>             Fix For: 2.35.0
>
>          Time Spent: 17h
>  Remaining Estimate: 0h
>
> This is implementing https://issues.apache.org/jira/browse/BEAM-5428 for the 
> Go SDK.
> Side inputs are valid for given window. That means they can be and correctly 
> cached and re-used in multiple bundles if the data has already been loaded 
> into a given worker.
> Side input data for a given bundle is specified and keyed by the window and a 
> a side input request token. The Go SDK presently doesn't use this information 
> for a SDK side worker cache of data.
> Care needs to be taken to allow the data to be garbage collected if it hasn't 
> been used in the last minute or so. In practice, Global Window side inputs 
> will not be evicted while still in use, but infrequently accessed data will 
> possibly be evicted more regularly. A limit should be set to avoid running 
> out of memory.
> As presently implemented, each ProcessElement call (or StartBundle or 
> FinishBundle) will only query for side input data on user request. However, 
> results of user requests are not cached between elements or bundles, which 
> means there's additional lookup and serialization time from the runner side.
> It may be possible to re-use allocated side input elements, but even a 
> solution that re-decodes data on every call would be an improvement if the 
> data isn't re-streamed from the runner.
> If https://issues.apache.org/jira/browse/BEAM-3293 has been implemented, then 
> per-key+window caching should additionally be considered for more granular 
> caching of data.
> This optimization has an outsized impact on streaming performance as side 
> input data can change per bundle, and streaming bundles are typically small, 
> and quickly processed.
> This work would be implemented in the exec package: 
>  
> [https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/sideinput.go]
>  
>  and probably the harness package
>  
> [https://github.com/apache/beam/blob/c185adb7652034b5d044aae65a1a972d9ceb4377/sdks/go/pkg/beam/core/runtime/harness/statemgr.go#L50]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to