On 18 Apr 2013, at 06:44, Dan Berindei <[email protected]> wrote:

> 
> 
> 
> On Wed, Apr 17, 2013 at 5:53 PM, Manik Surtani <[email protected]> wrote:
> 
> On 17 Apr 2013, at 08:23, Dan Berindei <[email protected]> wrote:
> 
>> I like the idea of always clearing the state in members of the minority 
>> partition(s), but one problem with that is that there may be some keys that 
>> only had owners in the minority partition(s). If we wiped the state of the 
>> minority partition members, those keys would be lost.
> 
> Right, this is my concern with such a wipe as well.
> 
>> Of course, you could argue that the cluster already lost those keys when we 
>> allowed the majority partition to continue working without having those 
>> keys... We could also rely on the topology information, and say that we only 
>> support partitioning when numOwners >= numSites (or numRacks, if there is 
>> only one site, or numMachines, if there is a single rack).
> 
> This is only true for an embedded app.  For an app communicating with the 
> cluster over Hot Rod, this isn't the case as it could directly read from the 
> minority partition.
> 
> 
> For that to happen, the client would have to be able to keep two (or more) 
> active consistent hashes at the same time. I think, at least in the first 
> phase, the servers in a minority partition should send a "look somewhere 
> else" response to any request from the client, so that it installs the 
> topology update of the majority partition and not the topology of one of the 
> minority partitions.

Tat might work with Hot Rod, but not REST or memcached endpoints.

> 
>> One other option is to perform a more complicated post-merge state transfer, 
>> in which each partition sends all the data it has to all the other 
>> partitions, and on the receiving end each node has a "conflict resolution" 
>> component that can merge two values. That is definitely more complicated 
>> than just going with a primary partition, though.
> 
> That sounds massively expensive.  I think the right solution at this point is 
> entry versioning using vector clocks and the vector clocks are exchanged and 
> compared during a merge.  Not the entire dataset.
> 
> 
> True, it would be very expensive. I think for many applications just 
> selecting a winner should be fine, though, so it might be worth implementing 
> this algorithm with the versioning support we already have as a POC.

I presume versioning (based on Vector Clocks) is now in master, as a part of 
Pedro's work on total order?

> 
>> One final point... when a node comes back online and it has a local cache 
>> store, it is very much as if we had a merge view. The current approach is to 
>> join as if the node didn't have any data, then delete everything from the 
>> cache store that is not mapped to the node in the consistent hash. Obviously 
>> that can lead to consistency problems, just like our current merge 
>> algorithm. It would be nice if we could handle both these cases the same way.
> 
> +1
>  
> 
> I've been thinking about this some more... the problem with the local cache 
> store is that nodes don't necessary start in the same order in which they 
> were shut down. So you might have enough nodes for a cluster to consider 
> itself "available", but only a slight overlap with the cluster as it looked 
> the last time it was "available" - so you would have stale data. 
> 
> We might be able to save the topology on shutdown and block startup until the 
> same nodes that were in the last "available" partition are all up, but it all 
> sounds a bit fragile.

Yeah, those nodes may never come up again.

- M

--
Manik Surtani
[email protected]
twitter.com/maniksurtani

Platform Architect, JBoss Data Grid
http://red.ht/data-grid

_______________________________________________
infinispan-dev mailing list
[email protected]
https://lists.jboss.org/mailman/listinfo/infinispan-dev

Reply via email to