[ 
https://issues.apache.org/jira/browse/BEAM-11097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Burke updated BEAM-11097:
--------------------------------
    Summary: [Go SDK] Windowed Side Input Caching (Cross Bundle)  (was: [Go 
SDK] Windowed Side Input Caching)

> [Go SDK] Windowed Side Input Caching (Cross Bundle)
> ---------------------------------------------------
>
>                 Key: BEAM-11097
>                 URL: https://issues.apache.org/jira/browse/BEAM-11097
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-go
>            Reporter: Robert Burke
>            Assignee: Jack McCluskey
>            Priority: P3
>              Labels: starter
>
> Side inputs are valid for given window. That means they can be  and correctly 
> cached and re-used in multiple bundles if the data has already been loaded 
> into a given worker. 
> Side input data for a given bundle is specified and keyed by the window and a 
> a side input request token. The Go SDK presently doesn't use this information 
> for a SDK side worker cache of data.
> Care needs to be taken to allow the data to be garbage collected if it hasn't 
> been used in the last minute or so. In practice, Global Window side inputs 
> will not be evicted while still in use, but infrequently accessed data will 
> possibly be evicted more regularly. A limit should be set to avoid running 
> out of memory.
> As presently implemented, each ProcessElement call (or StartBundle or 
> FinishBundle) will only query for side input data on user request. However, 
> results of user requests are not cached between elements or bundles, which 
> means there's additional lookup and serialization time from the runner side.
> It may be possible to re-use allocated side input elements, but even a 
> solution that re-decodes data on every call would be an improvement if the 
> data isn't re-streamed from the runner.
> If https://issues.apache.org/jira/browse/BEAM-3293 has been implemented, then 
> per-key+window caching should additionally be considered for more granular 
> caching of data.
> This optimization has an outsized impact on streaming performance as side 
> input data can change per bundle, and streaming bundles are typically small, 
> and quickly processed.
> This work would be implemented in the exec package: 
> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/sideinput.go
>  
> and probably the harness package
> https://github.com/apache/beam/blob/c185adb7652034b5d044aae65a1a972d9ceb4377/sdks/go/pkg/beam/core/runtime/harness/statemgr.go#L50



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to