Sanne and I resumed the meeting later yesterday afternoon, but we basically just rehashed the stuff that we've been discussing before lunch. Logs here:
(07:10:10 PM) jbott: Meeting ended Tue Jun 12 16:09:55 2012 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) (07:10:10 PM) jbott: Minutes: http://transcripts.jboss.org/meeting/irc.freenode.org/infinispan/2012/infinispan.2012-06-12-15.26.html (07:10:10 PM) jbott: Minutes (text): http://transcripts.jboss.org/meeting/irc.freenode.org/infinispan/2012/infinispan.2012-06-12-15.26.txt (07:10:10 PM) jbott: Log: http://transcripts.jboss.org/meeting/irc.freenode.org/infinispan/2012/infinispan.2012-06-12-15.26.log.html The main conclusion was that the number of total virtual nodes/hash segments will be fixed per cluster, not per node. Kind of like the old AbstractWheelConsistentHash.HASH_SPACE, only configurable. A physical node will have a variable number of vnodes/segments over its lifetime. We also decided to add a pull component to our state transfer. The current NBST design requires all the nodes to push state to a joiner more or less at the same time, which results in lots of congestion at the network layer and sometimes even in the joiner being excluded from the cluster. We have decided that a node will not start pushing data as soon as it receives the PREPARE_VIEW command from the coordinator, but instead it will wait for a START_PUSH command from the receiver. The receiver will only ask one previous owner at a time, thus eliminating the congestion. We've had a lot of back-and-forth discussions about whether the CH should be "non-deterministic". We agreed in the end that (I think) that it's fine if the creation of the CH is not based solely on the current members list, and it depends on the previous CH as well. This is quite important, I think it would be hard to find an algorithm based only on member list that doesn't change ownership for a lot of nodes in case of a leave (even if we use the previous members list as well): see https://issues.jboss.org/browse/ISPN-1275. I had an idea (that I'm pretty sure I didn't explained properly in the chat) that we could avoid state transfer blocking everything while receiving the transaction table from a previous owner by splitting the state transfer in two: * In the first phase, we'd pick the new backup owners for each segment, and we'd transfer all the state to them (entries, transaction table, etc.) * In the second phase, we'd pick a new primary owner for each segment, but the primary owner can only be one of the existing backup owners. Since the data has already been transferred, we can now also remove the extra owners. During the first phase, a segment could have more than numOwners owners, and commands would reach both the new owners and the old owners. We will need to handle commit commands for transactions that the new owner doesn't have yet in its transaction table, but we would not need to block prepare commands (like the current NBST design does). During the second phase, the new primary owner already has the transaction table, so we don't need a blocking phase either. I didn't explain this properly in the chat because I was certain it would only make sense if the coordinator initiated state transfer one node at a time, making it non-deterministic. But I think if we allow the CH creation algorithm to use the previous CH, we can deterministically decide if the backup owners are properly balanced (if not, we need to start phase 1) and if the primary owners are properly balanced (if not, we need to start phase 2). There is something else that I've been thinking about since yesterday that might improve performance and even simplify the state transfer at the cost of determinism. When state transfer fails (usually because a node has died, but not necessarily), the coordinator could ask each node how far it got with the state transfer in progress (how many segments they got, from which owners, etc). The coordinator would then create a new "base CH" based on the actually transferred data instead of the actual start CH or the "pending CH", or even the whole chain/tree of CHs, none of which reflect how data is effectively stored in the clustered at that moment. Because this base CH would reflect the actual owners of each segment, there would be less data moving around in the new state transfer and we wouldn't need to keep a chain/tree of previous owner lists either. I'm going to take a stab at implementing a new CH with a fixed number of vnodes, that can take an existing CH as input and change owners as little as possible. Then I'm going to try and implement the balanced backup owners/balanced primary owners check as well, just to see if it's really possible. I'm not going to modify the design document just yet, I need to see first if it does work and what you guys think about it... Cheers Dan On Tue, Jun 12, 2012 at 4:02 PM, Manik Surtani <[email protected]> wrote: > Meeting minutes from part 1. Had to break for lunch. :) > > > Meeting ended Tue Jun 12 13:00:43 2012 UTC. Information about MeetBot > athttp://wiki.debian.org/MeetBot . (v 0.1.4) > 14:01 > Minutes: > http://transcripts.jboss.org/meeting/irc.freenode.org/infinispan/2012/infinispan.2012-06-12-09.58.html > 14:01 > Minutes (text): > http://transcripts.jboss.org/meeting/irc.freenode.org/infinispan/2012/infinispan.2012-06-12-09.58.txt > 14:01 > Log: > http://transcripts.jboss.org/meeting/irc.freenode.org/infinispan/2012/infinispan.2012-06-12-09.58.log.html > > > -- > Manik Surtani > [email protected] > twitter.com/maniksurtani > > Project Lead, Infinispan > http://www.infinispan.org > > Platform Architect, JBoss Data Grid > http://www.redhat.com/promo/dg6beta > > > _______________________________________________ > infinispan-dev mailing list > [email protected] > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list [email protected] https://lists.jboss.org/mailman/listinfo/infinispan-dev
