Re: Existing transactionality inconsistency in the Beam Java State API

Ben Chambers Thu, 24 May 2018 09:40:35 -0700

I think Kenn's second option accurately reflects my memory of the original
intentions:

1. I remember we we considered either using the Future interface or calling
the ReadableState interface a future, and explicitly said "no, future
implies asynchrony and that the value returned by `get` won't change over
multiple calls, but we want the latest value each time". So, I remember us
explicitly considering and rejecting Future, thus the name "ReadableState".

2. The intuition behind the implementation was analogous to a
mutable-reference cell in languages like ML / Scheme / etc. The
ReadableState is just a pointer to the the reference cell. Calling read
returns the value currently in the cell. If we have 100 ReadableStates
pointing at the same cell, they all get the same value regardless of when
they were created. This avoids needing to duplicate/snapshot values at any
point in time.

3. ReadLater was added, as noted by Charles, to suggest prefetching the
associated value. This was added after benchmarks showed 10x (if I remember
correctly) performance improvements in things like GroupAlsoByWindows by
minimizing round-trips asking for more state. The intuition being -- if we
need to make an RPC to load one state value, we are better off making an
RPC to load all the values we need.

Overall, I too lean towards maintaining the second interpretation since it
seems to be consistent and I believe we had additional reasons for
preferring it over futures.

Given the confusion, I think strengthening the class documentation makes
sense -- I note the only hint of the current behavior is that ReadableState
indicates it gets the *current* value (emphasis mine). We should emphasize
that and perhaps even mention that the ReadableState should be understood
as just a reference or handle to the underlying state, and thus its value
will reflect the latest write.

Charles, if it helps, the plan I remember regarding prefetching was
something like:

interface ReadableMapState<K, V> {
   ReadableState<V> get(K key);
   ReadableState<Iterable<V>> getIterable();
   ReadableState<Map<K, V>> get();
   // ... more things ...
}

Then prefetching a value is `mapState.get(key).readLater()` and prefetching
the entire map is `mapState.get().readLater()`, etc.

On Wed, May 23, 2018 at 7:13 PM Charles Chen <[email protected]> wrote:

> Thanks Kenn.  I think there are two issues to highlight: (1) the API
> should allow for some sort of prefetching / batching / background I/O for
> state; and (2) it should be clear what the semantics are for reading (e.g.
> so we don't have confusing read after write behavior).
>
> The approach I'm leaning towards for (1) is to allow a state.prefetch()
> method (to prefetch a value, iterable or [entire] map state) and maybe
> something like state.prefetch_key(key) to prefetch a specific KV in the
> map.  Issue (2) seems to be okay in either of Kenn's positions.
>
> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <[email protected]>
> wrote:
>
>> Thanks for laying this out so well, Kenn. I'm also leaning towards the
>> second option, despite its drawbacks. (In particular, readLater should
>> not influence what's returned at read(), it's just a hint.)
>>
>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <[email protected]> wrote:
>>
>>> Great idea to bring it to dev@. I think it is better to focus here than
>>> long doc comment threads.
>>>
>>> I had strong opinions that I think were a bit confused and wrong. Sorry
>>> for that. I stated this position:
>>>
>>>  - XYZState class is a handle to a mutable location
>>>  - its methods like isEmpty() or contents() should return immutable
>>> future values (implicitly means their contents are semantically frozen when
>>> they are created)
>>>  - the fact that you created the future is a hint that all necessary
>>> fetching/computation should be kicked off
>>>  - later forced with get()
>>>  - when it was designed, pure async style was not a viable option
>>>
>>> I see now that the actual position of some of its original designers is:
>>>
>>>  - XYZState class is a view on a mutable location
>>>  - its methods return new views on that mutable location
>>>  - calling readLater() is a hint that some fetching/computation should
>>> be kicked off
>>>  - later read() will combine whatever readLater() did with additional
>>> local info to give the current value
>>>  - async style not applicable nor desirable as per Beam's focus on naive
>>> straight-line coding + autoscaling
>>>
>>> These are both internally consistent I think. In fact, I like the second
>>> perspective better than the one I have been promoting. There are some
>>> weaknesses: readLater() is pretty tightly coupled to a particular
>>> implementation style, and futures are decades old so you can get good APIs
>>> and performance without inventing anything. But I still like the non-future
>>> version a little better.
>>>
>>> Kenn
>>>
>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <[email protected]> wrote:
>>>
>>>> During the design of the Beam Python State API, we noticed some
>>>> transactionality inconsistencies in the existing Beam Java State API (these
>>>> are the unresolved bugs BEAM-2980
>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are therefore
>>>> having a discussion about this API which can have implications for its
>>>> future development in all Beam languages:
>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>
>>>> If you have an opinion on the possible design approaches, it would be
>>>> very helpful to bring up in the doc or on this thread.  Thanks!
>>>>
>>>> Best,
>>>> Charles
>>>>
>>>

Re: Existing transactionality inconsistency in the Beam Java State API

Reply via email to