On 18 Apr 2013, at 06:44, Dan Berindei <[email protected]> wrote:
> > > > On Wed, Apr 17, 2013 at 5:53 PM, Manik Surtani <[email protected]> wrote: > > On 17 Apr 2013, at 08:23, Dan Berindei <[email protected]> wrote: > >> I like the idea of always clearing the state in members of the minority >> partition(s), but one problem with that is that there may be some keys that >> only had owners in the minority partition(s). If we wiped the state of the >> minority partition members, those keys would be lost. > > Right, this is my concern with such a wipe as well. > >> Of course, you could argue that the cluster already lost those keys when we >> allowed the majority partition to continue working without having those >> keys... We could also rely on the topology information, and say that we only >> support partitioning when numOwners >= numSites (or numRacks, if there is >> only one site, or numMachines, if there is a single rack). > > This is only true for an embedded app. For an app communicating with the > cluster over Hot Rod, this isn't the case as it could directly read from the > minority partition. > > > For that to happen, the client would have to be able to keep two (or more) > active consistent hashes at the same time. I think, at least in the first > phase, the servers in a minority partition should send a "look somewhere > else" response to any request from the client, so that it installs the > topology update of the majority partition and not the topology of one of the > minority partitions. Tat might work with Hot Rod, but not REST or memcached endpoints. > >> One other option is to perform a more complicated post-merge state transfer, >> in which each partition sends all the data it has to all the other >> partitions, and on the receiving end each node has a "conflict resolution" >> component that can merge two values. That is definitely more complicated >> than just going with a primary partition, though. > > That sounds massively expensive. I think the right solution at this point is > entry versioning using vector clocks and the vector clocks are exchanged and > compared during a merge. Not the entire dataset. > > > True, it would be very expensive. I think for many applications just > selecting a winner should be fine, though, so it might be worth implementing > this algorithm with the versioning support we already have as a POC. I presume versioning (based on Vector Clocks) is now in master, as a part of Pedro's work on total order? > >> One final point... when a node comes back online and it has a local cache >> store, it is very much as if we had a merge view. The current approach is to >> join as if the node didn't have any data, then delete everything from the >> cache store that is not mapped to the node in the consistent hash. Obviously >> that can lead to consistency problems, just like our current merge >> algorithm. It would be nice if we could handle both these cases the same way. > > +1 > > > I've been thinking about this some more... the problem with the local cache > store is that nodes don't necessary start in the same order in which they > were shut down. So you might have enough nodes for a cluster to consider > itself "available", but only a slight overlap with the cluster as it looked > the last time it was "available" - so you would have stale data. > > We might be able to save the topology on shutdown and block startup until the > same nodes that were in the last "available" partition are all up, but it all > sounds a bit fragile. Yeah, those nodes may never come up again. - M -- Manik Surtani [email protected] twitter.com/maniksurtani Platform Architect, JBoss Data Grid http://red.ht/data-grid
_______________________________________________ infinispan-dev mailing list [email protected] https://lists.jboss.org/mailman/listinfo/infinispan-dev
