This is documented at
https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m
. Note that it requires participation of both the runner and the SDK
(though there are no correctness issues if one or the other side does
not understand the protocol, caching just won't be used).

I don't think it's been implemented anywhere, but could be very
beneficial for performance.

On Wed, Jul 17, 2019 at 6:00 PM Rakesh Kumar <rakeshku...@lyft.com> wrote:
>
> I checked the python sdk[1] and it has similar implementation as Java SDK.
>
> I would agree with Thomas. In case of high volume event stream and bigger 
> cluster size, network call can potentially cause a bottleneck.
>
> @Robert
> I am interested to see the proposal. Can you provide me the link of the 
> proposal?
>
> [1]: 
> https://github.com/apache/beam/blob/db59a3df665e094f0af17fe4d9df05fe420f3c16/sdks/python/apache_beam/transforms/userstate.py#L295
>
>
> On Tue, Jul 16, 2019 at 9:43 AM Thomas Weise <t...@apache.org> wrote:
>>
>> Thanks for the pointer. For streaming, it will be important to support 
>> caching across bundles. It appears that even the Java SDK doesn't support 
>> that yet?
>>
>> https://github.com/apache/beam/blob/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L221
>>
>> Regarding clear/append: It would be nice if both could occur within a single 
>> Fn Api roundtrip when the state is persisted.
>>
>> Thanks,
>> Thomas
>>
>>
>>
>> On Tue, Jul 16, 2019 at 6:58 AM Lukasz Cwik <lc...@google.com> wrote:
>>>
>>> User state is built on top of read, append and clear and not off a read and 
>>> write paradigm to allow for blind appends.
>>>
>>> The optimization you speak of can be done completely inside the SDK without 
>>> any additional protocol being required as long as you clear the state first 
>>> and then append all your new data. The Beam Java SDK does this for all 
>>> runners when executed portably[1]. You could port the same logic to the 
>>> Beam Python SDK as well.
>>>
>>> 1: 
>>> https://github.com/apache/beam/blob/41478d00d34598e56471d99d0845ac16efa5b8ef/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/BagUserState.java#L84
>>>
>>> On Tue, Jul 16, 2019 at 5:54 AM Robert Bradshaw <rober...@google.com> wrote:
>>>>
>>>> Python workers also have a per-bundle SDK-side cache. A protocol has
>>>> been proposed, but hasn't yet been implemented in any SDKs or runners.
>>>>
>>>> On Tue, Jul 16, 2019 at 6:02 AM Reuven Lax <re...@google.com> wrote:
>>>> >
>>>> > It's runner dependent. Some runners (e.g. the Dataflow runner) do have 
>>>> > such a cache, though I think it's currently has a cap for large bags.
>>>> >
>>>> > Reuven
>>>> >
>>>> > On Mon, Jul 15, 2019 at 8:48 PM Rakesh Kumar <rakeshku...@lyft.com> 
>>>> > wrote:
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> I have been using python sdk for the application and also using 
>>>> >> BagState in production. I was wondering whether state logic has any 
>>>> >> write-through-cache implemented or not. If we are sending every read 
>>>> >> and write request through network then it comes with a performance 
>>>> >> cost. We can avoid network call for a read operation if we have 
>>>> >> write-through-cache.
>>>> >> I have superficially looked into the implementation and I didn't see 
>>>> >> any cache implementation.
>>>> >>
>>>> >> is it possible to have this cache? would it cause any issue if we have 
>>>> >> the caching layer?
>>>> >>

Reply via email to