Hi, @Dan, see comment inline.
Cheers, Pedro On 3/15/12 1:36 PM, Dan Berindei wrote: > On Thu, Mar 15, 2012 at 10:31 AM, Bela Ban<[email protected]> wrote: >> >> On 3/12/12 3:03 PM, Dan Berindei wrote: >>> On Sat, Mar 10, 2012 at 7:07 PM, Bela Ban<[email protected]> wrote: >>>> Can you confirm that my understanding of how DIST works is correct ? >>>> >>>> >>>> #1 Non transactional modifications >>>> - E.g. a PUT >>>> - The PUT is sent to the primary owner (e.g. B) and all backup owners >>>> (e.g. C) and is applied immediately, with only local lock acquisition >>>> (lock-put-unlock) >>> This is correct, the key lock is acquired in parallel on all the >>> owners, so two concurrent PUTs could result in different owners having >>> different values. >> >> So the section that issues the (say) 2 unicast PUTs isn't protected by a >> lock ? I'm not suggesting to include the waiting for the results from >> the 2 futures should be synchronized, but only the *sending* of the 2 >> unicasts. Then if thread 1 acquired that lock first and thread 2 second, >> thread's 1's modifications would be delivered *before* thread 2's >> modifications. >> Of course, this would order only messages from the same sender. >> > We do acquire the key lock on the originator as well, I just missed it > when I read the code. > >>>> (or is the PUT only sent to B, which in turn then updates C?) >>>> >>>> >>>> #2 Transactional modifications >>>> - The modifications involve a bunch of keys >>>> - When the TX commits: >>>> - A PREPARE message with the relevant keys is sent to all primary >>>> owners P (to the backup owners as well?) >>> The PREPARE is sent to the backup owners as well. >>> The complete list of modifications is sent to all the owners, primary or >>> backup. >> >> OK, we discussed this in IRC, thanks for the clarification. >> >> If we touch a lot of keys, then sending *all* of the keys to all owners >> may be sub-optimal; as an optimization, we may want to send only the >> keys to the nodes which need to store them. This would make the PREPARES >> potentially much smaller. >> > Agree, but it's a non-trivial optimization. For instance if there is a > view change between the prepare and the commit, the recipient of the > commit may not have all the modifications in the list. > With recovery enabled the originator would also have to get the list > of modifications from all the targets and combine it into a single > PREPARE command, much more complex than what happens now. > >> >>>> If so, then I can assume that a transactional modification touching a >>>> number of keys will almost always touch *all* nodes ? Example: >>>> - We have 10 nodes >>>> - numOwners = 2 >>>> - If we have a good consistent hash, I can assume that I have to modifiy >>>> 5 different keys (10 / 2) on average in a TX to touch *all* nodes in the >>>> cluster with the PREPARE/COMMIT phase, correct ? >>>> >>> You have to modify *at least* 5 different keys to touch all the nodes. >>> Some keys will share the primary owner, but also one key's primary >>> owner can be another's backup owner - so the chances to touch all >>> nodes with only 5 keys are very low. >>> >>> It turns out the probability of a tx touching all the nodes depends on >>> whether we have virtual nodes enabled or not - with lots of virtual >>> nodes all the {(i, j) where 0<= i != j< numNodes} combinations are >>> valid, whereas without virtual nodes only numNodes combinations are >>> allowed - {(i, (i+1)%clusterSize) where 0<= i< numNodes}. >>> >>> I've run some simulations with IPython assuming virtual nodes, and >>> these are the number of keys per tx required to touch all the nodes >>> 50% of the time for a few representative cluster sizes: >>> 4: 4, 8: 10, 16: 25, 32: 61 >> >> So we have to touch roughly clusterSize * 2 keys to send modification >> requests to all nodes. >> > Actually the number of required keys is growing a little faster. The > chances of 2 * clusterSize keys touching all the nodes are > 2: 1.0, > 4: 0.985, > 8: 0.91, > 16: 0.796, > 32: 0.585, > 64: 0.327, > 128: 0.092 > >>> These are the numbers of keys required to touch numNodes/2 nodes 50% >>> of the time: >>> 4: 1, 8: 2, 16: 5, 32: 11 >> >> OK >> >> >>> And these are the numbers of keys required to touch numNodes/2 nodes >>> 90% of the time: >>> 4: 1, 8: 3, 16: 7, 32: 13 >>> >>> >>>> If my last statement is correct, is it safe to assume that with DIST and >>>> transactional modifications, I will have a lot of TX contention / >>>> collisions ? >>>> >>> Not sure what you mean by lot of TX contention - lock contention >>> should only depend on the dataset size, unless we use lock striping, >>> in which case it depends on the configured concurrency level. >> >> I meant TX rollbacks due to overlapping locks at different nodes, the >> stuff Pedro wrote about in his paper on total order. >> > Hmm, I thought because we sort the keys before locking it shouldn't be > possible to have deadlocks between prepare commands. I was assuming > that the Tx aborts in Pedro's tests were due to write skew check > failures, but I just read his message again and he mentions write skew > check is disabled. > I must be missing something... I think that a transaction aborts if a scenario like this occurs: 4 nodes, N1 to N4. N2 is the primary owner of KeyA. N3 is the primary owner of KeyB. N1 is executing the transaction Tx1 which writes in A and B N4 is executing the transaction Tx2 which writes in A and B. Both transactions try to prepare at the same time. This scenario can occurs (I think): N2 -> deliver(Tx1), lock(KeyA), deliver(Tx2), tryLock(KeyA) //Tx2 is blocked until the lock of KeyA is released N3 -> deliver(Tx2), lock(KeyB), deliver(Tx1), tryLock(KeyB) //Tx1 is blocked until the lock of KeyB is released Eventually Tx1 or Tx2 (or both) will be aborted by a timeout. Is this behavior correct? Am I missing something? >>> Network traffic would probably increase quadratically with the number >>> of keys/tx because we do send all the modifications to all the owners, >>> but we could fix that by either transmitting only the relevant bits to >>> each owner >> >> IMO that's an optimization that should be done regardless, see my >> comment above. >> > It does complicate other parts of the code, so I don't think it's such > a clear win. > >>> or by switching to multicast after the number of targets hits a threshold. >> >> The problem of switching between unicasts and multicasts is that there >> is no ordering between unicast and multicast messages. So if you send an >> anycast M1 (2 unicasts) to B,C, followed by a multicast M2, M1 or M2 >> could be delivered first. So A could deliver M1 first, folllowed by M2, >> while B would deliver M2 followed by M1. >> >> If the above scenario is not a problem because you acquire locks anyway, >> we could do it though... might be a nice optimization ! >> >> >>>> If this is correct, this would IMO lay even more importance onto the >>>> work done by the Cloud-TM team, replacing 2PC with total order. Also, if >>>> we touch almost all nodes, would it make sense to use SEQUENCER for >>>> *all* updates ? Would this obviliate the need for TOM (total order for >>>> partial replication) ? >>>> >>> I don't enough about Cloud-TM to comment here, but it appears >>> SEQUENCER requires the coordinator to receive forward requests from >>> all the nodes and send nothing back on the unicast channel. >> >> Yes, this is basically the same problem: with SEQUENCER, there is no >> ordering between unicasts and multicasts, either. >> >> >> >>> Sending the whole tx as a multicast would certainly be more efficient >>> than what we do now with lots of targets. >> >> Just as confirmation of my understanding: >> - We have {A,B,C,D,E} >> - A TX modifies keys K1 (maps to A:B), K2 (maps to B:C), K5 (maps to E:A) >> >> Do we send 3 anycasts A:B, B:C and E:A or do we send 1 anycast A:B:C:E ? >> I think it's the latter, but please confirm... >> > Nope, we haven't switched to anycasts yet. > We do serialize the command beforehand, so we don't have triple > serialization overhead, but we do need an extra FutureCollator object. > > My assumption is that the multicast itself is more efficient, because > from the originator to the switch it's just one packet. I wasn't > thinking of the extra stuff Infinispan and JGroups have to keep for > the multiple unicast requests. > > Cheers > Dan > > _______________________________________________ > infinispan-dev mailing list > [email protected] > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list [email protected] https://lists.jboss.org/mailman/listinfo/infinispan-dev
