Re: Write-through-cache in State logic

Rakesh Kumar Tue, 23 Jul 2019 21:21:23 -0700

Thanks Robert,

 I stumble on the jira that you have created some time ago
https://jira.apache.org/jira/browse/BEAM-5428


You also marked code where code changes are required:
https://github.com/apache/beam/blob/7688bcfc8ebb4bedf26c5c3b3fe0e13c0ec2aa6d/sdks/python/apache_beam/runners/worker/bundle_processor.py#L291
https://github.com/apache/beam/blob/7688bcfc8ebb4bedf26c5c3b3fe0e13c0ec2aa6d/sdks/python/apache_beam/runners/worker/bundle_processor.py#L349
https://github.com/apache/beam/blob/7688bcfc8ebb4bedf26c5c3b3fe0e13c0ec2aa6d/sdks/python/apache_beam/runners/worker/bundle_processor.py#L465

I am willing to provide help to implement this. Let me know how I can help.



On Tue, Jul 23, 2019 at 3:47 AM Robert Bradshaw <rober...@google.com> wrote:

> This is documented at
>
> https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m
> . Note that it requires participation of both the runner and the SDK
> (though there are no correctness issues if one or the other side does
> not understand the protocol, caching just won't be used).
>
> I don't think it's been implemented anywhere, but could be very
> beneficial for performance.
>
> On Wed, Jul 17, 2019 at 6:00 PM Rakesh Kumar <rakeshku...@lyft.com> wrote:
> >
> > I checked the python sdk[1] and it has similar implementation as Java
> SDK.
> >
> > I would agree with Thomas. In case of high volume event stream and
> bigger cluster size, network call can potentially cause a bottleneck.
> >
> > @Robert
> > I am interested to see the proposal. Can you provide me the link of the
> proposal?
> >
> > [1]:
> https://github.com/apache/beam/blob/db59a3df665e094f0af17fe4d9df05fe420f3c16/sdks/python/apache_beam/transforms/userstate.py#L295
> >
> >
> > On Tue, Jul 16, 2019 at 9:43 AM Thomas Weise <t...@apache.org> wrote:
> >>
> >> Thanks for the pointer. For streaming, it will be important to support
> caching across bundles. It appears that even the Java SDK doesn't support
> that yet?
> >>
> >>
> https://github.com/apache/beam/blob/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L221
> >>
> >> Regarding clear/append: It would be nice if both could occur within a
> single Fn Api roundtrip when the state is persisted.
> >>
> >> Thanks,
> >> Thomas
> >>
> >>
> >>
> >> On Tue, Jul 16, 2019 at 6:58 AM Lukasz Cwik <lc...@google.com> wrote:
> >>>
> >>> User state is built on top of read, append and clear and not off a
> read and write paradigm to allow for blind appends.
> >>>
> >>> The optimization you speak of can be done completely inside the SDK
> without any additional protocol being required as long as you clear the
> state first and then append all your new data. The Beam Java SDK does this
> for all runners when executed portably[1]. You could port the same logic to
> the Beam Python SDK as well.
> >>>
> >>> 1:
> https://github.com/apache/beam/blob/41478d00d34598e56471d99d0845ac16efa5b8ef/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/BagUserState.java#L84
> >>>
> >>> On Tue, Jul 16, 2019 at 5:54 AM Robert Bradshaw <rober...@google.com>
> wrote:
> >>>>
> >>>> Python workers also have a per-bundle SDK-side cache. A protocol has
> >>>> been proposed, but hasn't yet been implemented in any SDKs or runners.
> >>>>
> >>>> On Tue, Jul 16, 2019 at 6:02 AM Reuven Lax <re...@google.com> wrote:
> >>>> >
> >>>> > It's runner dependent. Some runners (e.g. the Dataflow runner) do
> have such a cache, though I think it's currently has a cap for large bags.
> >>>> >
> >>>> > Reuven
> >>>> >
> >>>> > On Mon, Jul 15, 2019 at 8:48 PM Rakesh Kumar <rakeshku...@lyft.com>
> wrote:
> >>>> >>
> >>>> >> Hi,
> >>>> >>
> >>>> >> I have been using python sdk for the application and also using
> BagState in production. I was wondering whether state logic has any
> write-through-cache implemented or not. If we are sending every read and
> write request through network then it comes with a performance cost. We can
> avoid network call for a read operation if we have write-through-cache.
> >>>> >> I have superficially looked into the implementation and I didn't
> see any cache implementation.
> >>>> >>
> >>>> >> is it possible to have this cache? would it cause any issue if we
> have the caching layer?
> >>>> >>
>

Re: Write-through-cache in State logic

Reply via email to