Re: Existing transactionality inconsistency in the Beam Java State API

Lukasz Cwik Wed, 06 Jun 2018 13:41:57 -0700

Sounds great and thanks for the conclusion summary.

On Tue, Jun 5, 2018 at 4:56 PM Charles Chen <[email protected]> wrote:


> Thanks everyone for commenting and contributing to the discussion.  There
> appears to be enough consensus on these points to start an initial
> implementation.  Specifically, I'd like to highlight from the doc (
> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
> ):
>
> *With respect to existing state data and transactionality: *We will go
> with presenting a consistent view of state data, in that after a write, a
> read will always return a value that reflects the result of that write.
> When a read returning an iterator object is done, an implicit snapshot of
> the underlying value at that point will be taken, and any subsequent
> mutation of the state will not change the values contained in, or
> invalidate, a previously returned iterator.  We will use state.prefetch()
> and state.prefetch_key(key) to suggest that the runner prefetch relevant
> data; this will supercede and replace the existing state.readLater()
> methods from the Java API, since the semantics of saying "prefetch" are
> much clearer to the user.
>
> *With respect to state for merging windows: *We will first implement proposed
> design 3
> <https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.wm9yk8l65u>
> by excluding an implementation of non-combinable state types like the
> non-combinable ValueState, since this is the most forwards-compatible
> option.  At a later point, we will either obtain community consensus to
> remove non-combinable ValueState from the Java SDK as well, or turn to 
> proposed
> design 1
> <https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.cyfcmrgayhn>
> by implementing non-combinable state types in the Python SDK and reject
> jobs that use non-combinable state types with merging windows.
>
> Best,
> Charles
>
> On Fri, May 25, 2018 at 10:56 AM Lukasz Cwik <[email protected]> wrote:
>
>> Great, I was confused in the description that was provided and then the
>> follow up by Ben. I think its worthwhile to describe the differences with
>> actual examples of what happens.
>>
>> On Fri, May 25, 2018 at 10:54 AM Kenneth Knowles <[email protected]> wrote:
>>
>>> I think the return value of read() should always be an immutable value.
>>>
>>> Kenn
>>>
>>> On Fri, May 25, 2018 at 10:44 AM Lukasz Cwik <[email protected]> wrote:
>>>
>>>> Kenn, in the second example where we are creating views whenever read()
>>>> is called, is it that the view's underlying data is immutable. For example:
>>>> Iterable<String> values = state.read();
>>>> state.append("newValue");
>>>> If I iterate over values, does values now contain "newValues"?
>>>>
>>>>
>>>> On Thu, May 24, 2018 at 10:38 AM Kenneth Knowles <[email protected]>
>>>> wrote:
>>>>
>>>>> I see what you mean but I don't agree that futures imply anything
>>>>> other than "it is a value that you have to force", with deliberately many
>>>>> possible implementations. When at the point in 1 and you've got
>>>>>
>>>>>     interface ReadableState<T> {
>>>>>         T read()
>>>>>     }
>>>>>
>>>>> and you want to improve performance, both approaches "void
>>>>> readLater()" and "StateFuture<T> read()" are natural evolutions. They both
>>>>> gain the same 10x and they both support the "unchanging committed state
>>>>> plus buffered mutations" implementation well. And snapshots are 
>>>>> essentially
>>>>> free for that implementation if the buffered mutations are stored in a
>>>>> decent data structure.
>>>>>
>>>>> My recollection was that futures were seen as more cumbersome. They
>>>>> affect the types even for simple uses. The only appealing future API was
>>>>> Guava's, but we didn't want that on the API surface. And we did not intend
>>>>> for these to be used in complex ways, so the usability & perf benefits of 
>>>>> a
>>>>> future-based API weren't really realized anyhow.
>>>>>
>>>>> The only reason I belabor this is that if we ever wanted to support
>>>>> more complex use cases, such as concurrent use of state, then my 
>>>>> preference
>>>>> would flip. I wouldn't want to make XYZState a synchronized monitor. At
>>>>> that point I'd favor using a snapshots-are-free concurrent data structure
>>>>> under the hood of a future-based API.
>>>>>
>>>>> Since there is really only one implementation in mind for this, maybe
>>>>> only one that works reasonably, we should just document it as such. The
>>>>> docs on ReadableState do make it sound like writes will invalidate the
>>>>> usefulness of readLater, even though that isn't the case for the intended
>>>>> implementation strategy.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Thu, May 24, 2018 at 9:40 AM Ben Chambers <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I think Kenn's second option accurately reflects my memory of the
>>>>>> original intentions:
>>>>>>
>>>>>> 1. I remember we we considered either using the Future interface or
>>>>>> calling the ReadableState interface a future, and explicitly said "no,
>>>>>> future implies asynchrony and that the value returned by `get` won't 
>>>>>> change
>>>>>> over multiple calls, but we want the latest value each time". So, I
>>>>>> remember us explicitly considering and rejecting Future, thus the name
>>>>>> "ReadableState".
>>>>>>
>>>>>> 2. The intuition behind the implementation was analogous to a
>>>>>> mutable-reference cell in languages like ML / Scheme / etc. The
>>>>>> ReadableState is just a pointer to the the reference cell. Calling read
>>>>>> returns the value currently in the cell. If we have 100 ReadableStates
>>>>>> pointing at the same cell, they all get the same value regardless of when
>>>>>> they were created. This avoids needing to duplicate/snapshot values at 
>>>>>> any
>>>>>> point in time.
>>>>>>
>>>>>> 3. ReadLater was added, as noted by Charles, to suggest prefetching
>>>>>> the associated value. This was added after benchmarks showed 10x (if I
>>>>>> remember correctly) performance improvements in things like
>>>>>> GroupAlsoByWindows by minimizing round-trips asking for more state. The
>>>>>> intuition being -- if we need to make an RPC to load one state value, we
>>>>>> are better off making an RPC to load all the values we need.
>>>>>>
>>>>>> Overall, I too lean towards maintaining the second interpretation
>>>>>> since it seems to be consistent and I believe we had additional reasons 
>>>>>> for
>>>>>> preferring it over futures.
>>>>>>
>>>>>> Given the confusion, I think strengthening the class documentation
>>>>>> makes sense -- I note the only hint of the current behavior is that
>>>>>> ReadableState indicates it gets the *current* value (emphasis mine). We
>>>>>> should emphasize that and perhaps even mention that the ReadableState
>>>>>> should be understood as just a reference or handle to the underlying 
>>>>>> state,
>>>>>> and thus its value will reflect the latest write.
>>>>>>
>>>>>> Charles, if it helps, the plan I remember regarding prefetching was
>>>>>> something like:
>>>>>>
>>>>>> interface ReadableMapState<K, V> {
>>>>>>    ReadableState<V> get(K key);
>>>>>>    ReadableState<Iterable<V>> getIterable();
>>>>>>    ReadableState<Map<K, V>> get();
>>>>>>    // ... more things ...
>>>>>> }
>>>>>>
>>>>>> Then prefetching a value is `mapState.get(key).readLater()` and
>>>>>> prefetching the entire map is `mapState.get().readLater()`, etc.
>>>>>>
>>>>>> On Wed, May 23, 2018 at 7:13 PM Charles Chen <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks Kenn.  I think there are two issues to highlight: (1) the API
>>>>>>> should allow for some sort of prefetching / batching / background I/O 
>>>>>>> for
>>>>>>> state; and (2) it should be clear what the semantics are for reading 
>>>>>>> (e.g.
>>>>>>> so we don't have confusing read after write behavior).
>>>>>>>
>>>>>>> The approach I'm leaning towards for (1) is to allow a
>>>>>>> state.prefetch() method (to prefetch a value, iterable or [entire] map
>>>>>>> state) and maybe something like state.prefetch_key(key) to prefetch a
>>>>>>> specific KV in the map.  Issue (2) seems to be okay in either of Kenn's
>>>>>>> positions.
>>>>>>>
>>>>>>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for laying this out so well, Kenn. I'm also leaning towards
>>>>>>>> the second option, despite its drawbacks. (In particular, readLater
>>>>>>>> should not influence what's returned at read(), it's just a hint.)
>>>>>>>>
>>>>>>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Great idea to bring it to dev@. I think it is better to focus
>>>>>>>>> here than long doc comment threads.
>>>>>>>>>
>>>>>>>>> I had strong opinions that I think were a bit confused and wrong.
>>>>>>>>> Sorry for that. I stated this position:
>>>>>>>>>
>>>>>>>>>  - XYZState class is a handle to a mutable location
>>>>>>>>>  - its methods like isEmpty() or contents() should return
>>>>>>>>> immutable future values (implicitly means their contents are 
>>>>>>>>> semantically
>>>>>>>>> frozen when they are created)
>>>>>>>>>  - the fact that you created the future is a hint that all
>>>>>>>>> necessary fetching/computation should be kicked off
>>>>>>>>>  - later forced with get()
>>>>>>>>>  - when it was designed, pure async style was not a viable option
>>>>>>>>>
>>>>>>>>> I see now that the actual position of some of its original
>>>>>>>>> designers is:
>>>>>>>>>
>>>>>>>>>  - XYZState class is a view on a mutable location
>>>>>>>>>  - its methods return new views on that mutable location
>>>>>>>>>  - calling readLater() is a hint that some fetching/computation
>>>>>>>>> should be kicked off
>>>>>>>>>  - later read() will combine whatever readLater() did with
>>>>>>>>> additional local info to give the current value
>>>>>>>>>  - async style not applicable nor desirable as per Beam's focus on
>>>>>>>>> naive straight-line coding + autoscaling
>>>>>>>>>
>>>>>>>>> These are both internally consistent I think. In fact, I like the
>>>>>>>>> second perspective better than the one I have been promoting. There 
>>>>>>>>> are
>>>>>>>>> some weaknesses: readLater() is pretty tightly coupled to a particular
>>>>>>>>> implementation style, and futures are decades old so you can get good 
>>>>>>>>> APIs
>>>>>>>>> and performance without inventing anything. But I still like the 
>>>>>>>>> non-future
>>>>>>>>> version a little better.
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> During the design of the Beam Python State API, we noticed some
>>>>>>>>>> transactionality inconsistencies in the existing Beam Java State API 
>>>>>>>>>> (these
>>>>>>>>>> are the unresolved bugs BEAM-2980
>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are
>>>>>>>>>> therefore having a discussion about this API which can have 
>>>>>>>>>> implications
>>>>>>>>>> for its future development in all Beam languages:
>>>>>>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>>>>>>>
>>>>>>>>>> If you have an opinion on the possible design approaches, it
>>>>>>>>>> would be very helpful to bring up in the doc or on this thread.  
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Charles
>>>>>>>>>>
>>>>>>>>>

Re: Existing transactionality inconsistency in the Beam Java State API

Reply via email to