I checked the python sdk[1] and it has similar implementation as Java SDK.

I would agree with Thomas. In case of high volume event stream and bigger
cluster size, network call can potentially cause a bottleneck.

@Robert
I am interested to see the proposal. Can you provide me the link of the
proposal?

[1]:
https://github.com/apache/beam/blob/db59a3df665e094f0af17fe4d9df05fe420f3c16/sdks/python/apache_beam/transforms/userstate.py#L295


On Tue, Jul 16, 2019 at 9:43 AM Thomas Weise <t...@apache.org> wrote:

> Thanks for the pointer. For streaming, it will be important to support
> caching across bundles. It appears that even the Java SDK doesn't support
> that yet?
>
>
> https://github.com/apache/beam/blob/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L221
>
> Regarding clear/append: It would be nice if both could occur within a
> single Fn Api roundtrip when the state is persisted.
>
> Thanks,
> Thomas
>
>
>
> On Tue, Jul 16, 2019 at 6:58 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> User state is built on top of read, append and clear and not off a read
>> and write paradigm to allow for blind appends.
>>
>> The optimization you speak of can be done completely inside the SDK
>> without any additional protocol being required as long as you clear the
>> state first and then append all your new data. The Beam Java SDK does this
>> for all runners when executed portably[1]. You could port the same logic to
>> the Beam Python SDK as well.
>>
>> 1:
>> https://github.com/apache/beam/blob/41478d00d34598e56471d99d0845ac16efa5b8ef/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/BagUserState.java#L84
>>
>> On Tue, Jul 16, 2019 at 5:54 AM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> Python workers also have a per-bundle SDK-side cache. A protocol has
>>> been proposed, but hasn't yet been implemented in any SDKs or runners.
>>>
>>> On Tue, Jul 16, 2019 at 6:02 AM Reuven Lax <re...@google.com> wrote:
>>> >
>>> > It's runner dependent. Some runners (e.g. the Dataflow runner) do have
>>> such a cache, though I think it's currently has a cap for large bags.
>>> >
>>> > Reuven
>>> >
>>> > On Mon, Jul 15, 2019 at 8:48 PM Rakesh Kumar <rakeshku...@lyft.com>
>>> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I have been using python sdk for the application and also using
>>> BagState in production. I was wondering whether state logic has any
>>> write-through-cache implemented or not. If we are sending every read and
>>> write request through network then it comes with a performance cost. We can
>>> avoid network call for a read operation if we have write-through-cache.
>>> >> I have superficially looked into the implementation and I didn't see
>>> any cache implementation.
>>> >>
>>> >> is it possible to have this cache? would it cause any issue if we
>>> have the caching layer?
>>> >>
>>>
>>

Reply via email to