Kenn, in the second example where we are creating views whenever read() is called, is it that the view's underlying data is immutable. For example: Iterable<String> values = state.read(); state.append("newValue"); If I iterate over values, does values now contain "newValues"?
On Thu, May 24, 2018 at 10:38 AM Kenneth Knowles <k...@google.com> wrote: > I see what you mean but I don't agree that futures imply anything other > than "it is a value that you have to force", with deliberately many > possible implementations. When at the point in 1 and you've got > > interface ReadableState<T> { > T read() > } > > and you want to improve performance, both approaches "void readLater()" > and "StateFuture<T> read()" are natural evolutions. They both gain the same > 10x and they both support the "unchanging committed state plus buffered > mutations" implementation well. And snapshots are essentially free for that > implementation if the buffered mutations are stored in a decent data > structure. > > My recollection was that futures were seen as more cumbersome. They affect > the types even for simple uses. The only appealing future API was Guava's, > but we didn't want that on the API surface. And we did not intend for these > to be used in complex ways, so the usability & perf benefits of a > future-based API weren't really realized anyhow. > > The only reason I belabor this is that if we ever wanted to support more > complex use cases, such as concurrent use of state, then my preference > would flip. I wouldn't want to make XYZState a synchronized monitor. At > that point I'd favor using a snapshots-are-free concurrent data structure > under the hood of a future-based API. > > Since there is really only one implementation in mind for this, maybe only > one that works reasonably, we should just document it as such. The docs on > ReadableState do make it sound like writes will invalidate the usefulness > of readLater, even though that isn't the case for the intended > implementation strategy. > > Kenn > > On Thu, May 24, 2018 at 9:40 AM Ben Chambers <bchamb...@apache.org> wrote: > >> I think Kenn's second option accurately reflects my memory of the >> original intentions: >> >> 1. I remember we we considered either using the Future interface or >> calling the ReadableState interface a future, and explicitly said "no, >> future implies asynchrony and that the value returned by `get` won't change >> over multiple calls, but we want the latest value each time". So, I >> remember us explicitly considering and rejecting Future, thus the name >> "ReadableState". >> >> 2. The intuition behind the implementation was analogous to a >> mutable-reference cell in languages like ML / Scheme / etc. The >> ReadableState is just a pointer to the the reference cell. Calling read >> returns the value currently in the cell. If we have 100 ReadableStates >> pointing at the same cell, they all get the same value regardless of when >> they were created. This avoids needing to duplicate/snapshot values at any >> point in time. >> >> 3. ReadLater was added, as noted by Charles, to suggest prefetching the >> associated value. This was added after benchmarks showed 10x (if I remember >> correctly) performance improvements in things like GroupAlsoByWindows by >> minimizing round-trips asking for more state. The intuition being -- if we >> need to make an RPC to load one state value, we are better off making an >> RPC to load all the values we need. >> >> Overall, I too lean towards maintaining the second interpretation since >> it seems to be consistent and I believe we had additional reasons for >> preferring it over futures. >> >> Given the confusion, I think strengthening the class documentation makes >> sense -- I note the only hint of the current behavior is that ReadableState >> indicates it gets the *current* value (emphasis mine). We should emphasize >> that and perhaps even mention that the ReadableState should be understood >> as just a reference or handle to the underlying state, and thus its value >> will reflect the latest write. >> >> Charles, if it helps, the plan I remember regarding prefetching was >> something like: >> >> interface ReadableMapState<K, V> { >> ReadableState<V> get(K key); >> ReadableState<Iterable<V>> getIterable(); >> ReadableState<Map<K, V>> get(); >> // ... more things ... >> } >> >> Then prefetching a value is `mapState.get(key).readLater()` and >> prefetching the entire map is `mapState.get().readLater()`, etc. >> >> On Wed, May 23, 2018 at 7:13 PM Charles Chen <c...@google.com> wrote: >> >>> Thanks Kenn. I think there are two issues to highlight: (1) the API >>> should allow for some sort of prefetching / batching / background I/O for >>> state; and (2) it should be clear what the semantics are for reading (e.g. >>> so we don't have confusing read after write behavior). >>> >>> The approach I'm leaning towards for (1) is to allow a state.prefetch() >>> method (to prefetch a value, iterable or [entire] map state) and maybe >>> something like state.prefetch_key(key) to prefetch a specific KV in the >>> map. Issue (2) seems to be okay in either of Kenn's positions. >>> >>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <rober...@google.com> >>> wrote: >>> >>>> Thanks for laying this out so well, Kenn. I'm also leaning towards the >>>> second option, despite its drawbacks. (In particular, readLater should >>>> not influence what's returned at read(), it's just a hint.) >>>> >>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <k...@google.com> wrote: >>>> >>>>> Great idea to bring it to dev@. I think it is better to focus here >>>>> than long doc comment threads. >>>>> >>>>> I had strong opinions that I think were a bit confused and wrong. >>>>> Sorry for that. I stated this position: >>>>> >>>>> - XYZState class is a handle to a mutable location >>>>> - its methods like isEmpty() or contents() should return immutable >>>>> future values (implicitly means their contents are semantically frozen >>>>> when >>>>> they are created) >>>>> - the fact that you created the future is a hint that all necessary >>>>> fetching/computation should be kicked off >>>>> - later forced with get() >>>>> - when it was designed, pure async style was not a viable option >>>>> >>>>> I see now that the actual position of some of its original designers >>>>> is: >>>>> >>>>> - XYZState class is a view on a mutable location >>>>> - its methods return new views on that mutable location >>>>> - calling readLater() is a hint that some fetching/computation should >>>>> be kicked off >>>>> - later read() will combine whatever readLater() did with additional >>>>> local info to give the current value >>>>> - async style not applicable nor desirable as per Beam's focus on >>>>> naive straight-line coding + autoscaling >>>>> >>>>> These are both internally consistent I think. In fact, I like the >>>>> second perspective better than the one I have been promoting. There are >>>>> some weaknesses: readLater() is pretty tightly coupled to a particular >>>>> implementation style, and futures are decades old so you can get good APIs >>>>> and performance without inventing anything. But I still like the >>>>> non-future >>>>> version a little better. >>>>> >>>>> Kenn >>>>> >>>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <c...@google.com> wrote: >>>>> >>>>>> During the design of the Beam Python State API, we noticed some >>>>>> transactionality inconsistencies in the existing Beam Java State API >>>>>> (these >>>>>> are the unresolved bugs BEAM-2980 >>>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975 >>>>>> <https://issues.apache.org/jira/browse/BEAM-2975>). We are >>>>>> therefore having a discussion about this API which can have implications >>>>>> for its future development in all Beam languages: >>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b >>>>>> >>>>>> If you have an opinion on the possible design approaches, it would be >>>>>> very helpful to bring up in the doc or on this thread. Thanks! >>>>>> >>>>>> Best, >>>>>> Charles >>>>>> >>>>>