If we go this way, we need key -> state map for each component so that the state data can be repartitioned.
On Fri, May 4, 2018 at 11:44 PM, Karthik Ramasamy <kart...@streaml.io> wrote: > Instead - if it references > > topology name + component name + key range > > will it be better? > > cheers > /karthik > > > On Fri, May 4, 2018 at 11:23 PM, Ning Wang <wangnin...@gmail.com> wrote: > > > Currently I think each Instance serializes the state object into a byte > > array and checkpoint manager saves the byte array into a file. The file > is > > referenced by topology name + component name + instance id. > > > > On Fri, May 4, 2018 at 11:10 PM, Karthik Ramasamy <kart...@streaml.io> > > wrote: > > > > > I am not sure I understand why the state is tied to an instance? > > > > > > cheers > > > /karthik > > > > > > On Fri, May 4, 2018 at 4:36 PM, Thomas Cooper <tom.n.coo...@gmail.com> > > > wrote: > > > > > > > Yeah, state recovery is a bit more difficult with Heron's > architecture. > > > In > > > > Storm, the task IDs are not just values used for routing they > actually > > > > equate to a task instance within the executor. An executor which > > > currently > > > > processes the keys 4-8 actually contains 5 task instances of the same > > > > component. So for each task, they just save its state attached to the > > > > single task ID and reassemble executors with the new task instances. > > > > > > > > We don't want or have to do that with Heron instances but we would > need > > > to > > > > have some way to have a state change tied to the task (or routing key > > if > > > we > > > > go to the key range idea). For something like a word count you might > > save > > > > counts using a nested map like: { routing key : {word : count }}. The > > > > routing key could be included in the Tuple instance. However, whether > > > this > > > > pattern would work for more generic state cases I don't know? > > > > > > > > Tom Cooper > > > > W: www.tomcooper.org.uk | Twitter: @tomncooper > > > > <https://twitter.com/tomncooper> > > > > > > > > > > > > On Fri, 4 May 2018 at 15:54, Neng Lu <freen...@gmail.com> wrote: > > > > > > > > > +1 for this idea. As long as the predefined key space is large > > enough, > > > it > > > > > should work for most of the cases. > > > > > > > > > > Based on my experience with topologies, I never saw one component > has > > > > more > > > > > than 1000 instances in a topology. > > > > > > > > > > For recovering states from an update, there will be some problems > > > though. > > > > > Since the states stored in heron are strongly connected with each > > > > instance, > > > > > we either need to have > > > > > some resolver does the state repartitioning or stores states with > the > > > key > > > > > instead of with each instance. > > > > > > > > > > > > > > > > > > > > On Fri, May 4, 2018 at 3:01 PM, Karthik Ramasamy < > > kramas...@gmail.com> > > > > > wrote: > > > > > > > > > > > Thanks for sharing. I like the Storm approach > > > > > > > > > > > > - keeps the implementation simpler > > > > > > - state is deterministic across restarts > > > > > > - makes it easy to reason and debug > > > > > > > > > > > > The hard limit is not a problem at all since most of the > topologies > > > > will > > > > > > be never that big. > > > > > > If you can handle Twitter topologies cleanly, it is more that > > > > sufficient > > > > > I > > > > > > believe. > > > > > > > > > > > > cheers > > > > > > /karthik > > > > > > > > > > > > > On May 4, 2018, at 2:31 PM, Thomas Cooper < > > tom.n.coo...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > A while ago I emailed about the issue of how fields (key) > grouped > > > > > routing > > > > > > > in Heron was not consistent across an update and how this makes > > > > > > preserving > > > > > > > state across an update very difficult and also makes it > > > > > > > difficult/impossible to analyse or predict tuple flows through > a > > > > > > > current/proposed topology physical plan. > > > > > > > > > > > > > > I suggested adopting Storms approach of pre-defining a routing > > key > > > > > > > space for each component (eg 0-999), so that instead of an > > instance > > > > > > having > > > > > > > a single task id that gets reset at every update (eg 10) it > has a > > > > range > > > > > > of > > > > > > > id's (eg 10-16) that changes depending on the parallelism of > the > > > > > > component. > > > > > > > This has the advantage that a key will always hash to the same > > task > > > > ID > > > > > > for > > > > > > > the lifetime of the topology. Meaning recovering state for an > > > > instance > > > > > > > after a crash or update is just a case of pulling the state > > linked > > > to > > > > > the > > > > > > > keys in its task ID range. > > > > > > > > > > > > > > I know the above proposal has issues, not least of all placing > a > > > hard > > > > > > upper > > > > > > > limit on the scale out of a component, and that some > alternative > > > > ideas > > > > > > are > > > > > > > being floated for solving the stateful update issue. However, I > > > just > > > > > > wanted > > > > > > > to throw some more weight behind the Storm approach. There was > a > > > > recent > > > > > > > paper about high-performance network load balancing > > > > > > > <https://blog.acolyer.org/2018/05/03/stateless- > > > > > > datacenter-load-balancing-with-beamer/>that > > > > > > > describes an approach using a fixed key space similar to > Storm's > > > (see > > > > > the > > > > > > > section called Stable Hashing - they assign a range 100x the > > > expected > > > > > > > connection pool size - which we could do with heron to prevent > > ever > > > > > > hitting > > > > > > > the upper scaling limit). Also, this new load balancer, Beamer, > > > > claims > > > > > to > > > > > > > be twice as fast as Google's Maglev > > > > > > > <https://blog.acolyer.org/2016/03/21/maglev-a-fast-and- > > > > > > reliable-software-network-load-balancer/> > > > > > > > which again uses a pre-defined keyspace and ID ranges to create > > > > look-up > > > > > > > tables deterministically. > > > > > > > > > > > > > > I know a load balancer is a different beast to a stream > grouping > > > but > > > > > > there > > > > > > > are some interesting ideas in those papers (The links point to > > > > summary > > > > > > blog > > > > > > > posts so you don't have to read the whole paper). > > > > > > > > > > > > > > Anyway, I just thought I would those papers out there and see > > what > > > > > people > > > > > > > think. > > > > > > > > > > > > > > Tom Cooper > > > > > > > W: www.tomcooper.org.uk | Twitter: @tomncooper > > > > > > > <https://twitter.com/tomncooper> > > > > > > > > > > > > > > > > > > > > > > > > > > >