Sounds great and thanks for the conclusion summary. On Tue, Jun 5, 2018 at 4:56 PM Charles Chen <c...@google.com> wrote:
> Thanks everyone for commenting and contributing to the discussion. There > appears to be enough consensus on these points to start an initial > implementation. Specifically, I'd like to highlight from the doc ( > https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b > ): > > *With respect to existing state data and transactionality: *We will go > with presenting a consistent view of state data, in that after a write, a > read will always return a value that reflects the result of that write. > When a read returning an iterator object is done, an implicit snapshot of > the underlying value at that point will be taken, and any subsequent > mutation of the state will not change the values contained in, or > invalidate, a previously returned iterator. We will use state.prefetch() > and state.prefetch_key(key) to suggest that the runner prefetch relevant > data; this will supercede and replace the existing state.readLater() > methods from the Java API, since the semantics of saying "prefetch" are > much clearer to the user. > > *With respect to state for merging windows: *We will first implement proposed > design 3 > <https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.wm9yk8l65u> > by excluding an implementation of non-combinable state types like the > non-combinable ValueState, since this is the most forwards-compatible > option. At a later point, we will either obtain community consensus to > remove non-combinable ValueState from the Java SDK as well, or turn to > proposed > design 1 > <https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.cyfcmrgayhn> > by implementing non-combinable state types in the Python SDK and reject > jobs that use non-combinable state types with merging windows. > > Best, > Charles > > On Fri, May 25, 2018 at 10:56 AM Lukasz Cwik <lc...@google.com> wrote: > >> Great, I was confused in the description that was provided and then the >> follow up by Ben. I think its worthwhile to describe the differences with >> actual examples of what happens. >> >> On Fri, May 25, 2018 at 10:54 AM Kenneth Knowles <k...@google.com> wrote: >> >>> I think the return value of read() should always be an immutable value. >>> >>> Kenn >>> >>> On Fri, May 25, 2018 at 10:44 AM Lukasz Cwik <lc...@google.com> wrote: >>> >>>> Kenn, in the second example where we are creating views whenever read() >>>> is called, is it that the view's underlying data is immutable. For example: >>>> Iterable<String> values = state.read(); >>>> state.append("newValue"); >>>> If I iterate over values, does values now contain "newValues"? >>>> >>>> >>>> On Thu, May 24, 2018 at 10:38 AM Kenneth Knowles <k...@google.com> >>>> wrote: >>>> >>>>> I see what you mean but I don't agree that futures imply anything >>>>> other than "it is a value that you have to force", with deliberately many >>>>> possible implementations. When at the point in 1 and you've got >>>>> >>>>> interface ReadableState<T> { >>>>> T read() >>>>> } >>>>> >>>>> and you want to improve performance, both approaches "void >>>>> readLater()" and "StateFuture<T> read()" are natural evolutions. They both >>>>> gain the same 10x and they both support the "unchanging committed state >>>>> plus buffered mutations" implementation well. And snapshots are >>>>> essentially >>>>> free for that implementation if the buffered mutations are stored in a >>>>> decent data structure. >>>>> >>>>> My recollection was that futures were seen as more cumbersome. They >>>>> affect the types even for simple uses. The only appealing future API was >>>>> Guava's, but we didn't want that on the API surface. And we did not intend >>>>> for these to be used in complex ways, so the usability & perf benefits of >>>>> a >>>>> future-based API weren't really realized anyhow. >>>>> >>>>> The only reason I belabor this is that if we ever wanted to support >>>>> more complex use cases, such as concurrent use of state, then my >>>>> preference >>>>> would flip. I wouldn't want to make XYZState a synchronized monitor. At >>>>> that point I'd favor using a snapshots-are-free concurrent data structure >>>>> under the hood of a future-based API. >>>>> >>>>> Since there is really only one implementation in mind for this, maybe >>>>> only one that works reasonably, we should just document it as such. The >>>>> docs on ReadableState do make it sound like writes will invalidate the >>>>> usefulness of readLater, even though that isn't the case for the intended >>>>> implementation strategy. >>>>> >>>>> Kenn >>>>> >>>>> On Thu, May 24, 2018 at 9:40 AM Ben Chambers <bchamb...@apache.org> >>>>> wrote: >>>>> >>>>>> I think Kenn's second option accurately reflects my memory of the >>>>>> original intentions: >>>>>> >>>>>> 1. I remember we we considered either using the Future interface or >>>>>> calling the ReadableState interface a future, and explicitly said "no, >>>>>> future implies asynchrony and that the value returned by `get` won't >>>>>> change >>>>>> over multiple calls, but we want the latest value each time". So, I >>>>>> remember us explicitly considering and rejecting Future, thus the name >>>>>> "ReadableState". >>>>>> >>>>>> 2. The intuition behind the implementation was analogous to a >>>>>> mutable-reference cell in languages like ML / Scheme / etc. The >>>>>> ReadableState is just a pointer to the the reference cell. Calling read >>>>>> returns the value currently in the cell. If we have 100 ReadableStates >>>>>> pointing at the same cell, they all get the same value regardless of when >>>>>> they were created. This avoids needing to duplicate/snapshot values at >>>>>> any >>>>>> point in time. >>>>>> >>>>>> 3. ReadLater was added, as noted by Charles, to suggest prefetching >>>>>> the associated value. This was added after benchmarks showed 10x (if I >>>>>> remember correctly) performance improvements in things like >>>>>> GroupAlsoByWindows by minimizing round-trips asking for more state. The >>>>>> intuition being -- if we need to make an RPC to load one state value, we >>>>>> are better off making an RPC to load all the values we need. >>>>>> >>>>>> Overall, I too lean towards maintaining the second interpretation >>>>>> since it seems to be consistent and I believe we had additional reasons >>>>>> for >>>>>> preferring it over futures. >>>>>> >>>>>> Given the confusion, I think strengthening the class documentation >>>>>> makes sense -- I note the only hint of the current behavior is that >>>>>> ReadableState indicates it gets the *current* value (emphasis mine). We >>>>>> should emphasize that and perhaps even mention that the ReadableState >>>>>> should be understood as just a reference or handle to the underlying >>>>>> state, >>>>>> and thus its value will reflect the latest write. >>>>>> >>>>>> Charles, if it helps, the plan I remember regarding prefetching was >>>>>> something like: >>>>>> >>>>>> interface ReadableMapState<K, V> { >>>>>> ReadableState<V> get(K key); >>>>>> ReadableState<Iterable<V>> getIterable(); >>>>>> ReadableState<Map<K, V>> get(); >>>>>> // ... more things ... >>>>>> } >>>>>> >>>>>> Then prefetching a value is `mapState.get(key).readLater()` and >>>>>> prefetching the entire map is `mapState.get().readLater()`, etc. >>>>>> >>>>>> On Wed, May 23, 2018 at 7:13 PM Charles Chen <c...@google.com> wrote: >>>>>> >>>>>>> Thanks Kenn. I think there are two issues to highlight: (1) the API >>>>>>> should allow for some sort of prefetching / batching / background I/O >>>>>>> for >>>>>>> state; and (2) it should be clear what the semantics are for reading >>>>>>> (e.g. >>>>>>> so we don't have confusing read after write behavior). >>>>>>> >>>>>>> The approach I'm leaning towards for (1) is to allow a >>>>>>> state.prefetch() method (to prefetch a value, iterable or [entire] map >>>>>>> state) and maybe something like state.prefetch_key(key) to prefetch a >>>>>>> specific KV in the map. Issue (2) seems to be okay in either of Kenn's >>>>>>> positions. >>>>>>> >>>>>>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <rober...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for laying this out so well, Kenn. I'm also leaning towards >>>>>>>> the second option, despite its drawbacks. (In particular, readLater >>>>>>>> should not influence what's returned at read(), it's just a hint.) >>>>>>>> >>>>>>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <k...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Great idea to bring it to dev@. I think it is better to focus >>>>>>>>> here than long doc comment threads. >>>>>>>>> >>>>>>>>> I had strong opinions that I think were a bit confused and wrong. >>>>>>>>> Sorry for that. I stated this position: >>>>>>>>> >>>>>>>>> - XYZState class is a handle to a mutable location >>>>>>>>> - its methods like isEmpty() or contents() should return >>>>>>>>> immutable future values (implicitly means their contents are >>>>>>>>> semantically >>>>>>>>> frozen when they are created) >>>>>>>>> - the fact that you created the future is a hint that all >>>>>>>>> necessary fetching/computation should be kicked off >>>>>>>>> - later forced with get() >>>>>>>>> - when it was designed, pure async style was not a viable option >>>>>>>>> >>>>>>>>> I see now that the actual position of some of its original >>>>>>>>> designers is: >>>>>>>>> >>>>>>>>> - XYZState class is a view on a mutable location >>>>>>>>> - its methods return new views on that mutable location >>>>>>>>> - calling readLater() is a hint that some fetching/computation >>>>>>>>> should be kicked off >>>>>>>>> - later read() will combine whatever readLater() did with >>>>>>>>> additional local info to give the current value >>>>>>>>> - async style not applicable nor desirable as per Beam's focus on >>>>>>>>> naive straight-line coding + autoscaling >>>>>>>>> >>>>>>>>> These are both internally consistent I think. In fact, I like the >>>>>>>>> second perspective better than the one I have been promoting. There >>>>>>>>> are >>>>>>>>> some weaknesses: readLater() is pretty tightly coupled to a particular >>>>>>>>> implementation style, and futures are decades old so you can get good >>>>>>>>> APIs >>>>>>>>> and performance without inventing anything. But I still like the >>>>>>>>> non-future >>>>>>>>> version a little better. >>>>>>>>> >>>>>>>>> Kenn >>>>>>>>> >>>>>>>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <c...@google.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> During the design of the Beam Python State API, we noticed some >>>>>>>>>> transactionality inconsistencies in the existing Beam Java State API >>>>>>>>>> (these >>>>>>>>>> are the unresolved bugs BEAM-2980 >>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975 >>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2975>). We are >>>>>>>>>> therefore having a discussion about this API which can have >>>>>>>>>> implications >>>>>>>>>> for its future development in all Beam languages: >>>>>>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b >>>>>>>>>> >>>>>>>>>> If you have an opinion on the possible design approaches, it >>>>>>>>>> would be very helpful to bring up in the doc or on this thread. >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Charles >>>>>>>>>> >>>>>>>>>