On Tue, Feb 26, 2013 at 12:57 PM, Pedro Ruivo <[email protected]> wrote:
> hi, > > I found the blocking problem with the state transfer this morning. It > happens because of the reordering of a regular and OOB message. > > Below, is a simplification of what is happening for two nodes > > A: total order broadcasts rebalance_start > > B: (incoming thread) delivers rebalance_start > B: has no segments to request so the rebalance is done > B: sends async request with rebalance_confirm (unicast #x) > B: sends the rebalance_start response (unicast #x+1) (the response is a > regular message) > > A: receives rebalance_start response (unicast #x+1) > A: in UNICAST2, it detects the message is out-of-order and blocks the > response in the sender window (i.e. the message #x is missing) > A: receives the rebalance_confirm (unicast #x) > A: delivers rebalance_confirm. Infinispan blocks this command until all > the rebalance_start responses are received ==> this originates a deadlock! > (because the response is blocked in unicast layer) > > Question: can the request's response message be sent always as OOB? (I > think the answer should be no...) > > We could, if Bela adds the send(Message) method to the Response interface... and personally I think it would be better to make all responses OOB (as in JGroups 3.2.x). I don't have any data to back this up, though... > My suggestion: when I deliver a rebalance_confirm command (that it is send > async), can I move it to a thread in async_thread_pool_executor? > > I have WIP fix for https://issues.jboss.org/browse/ISPN-2825, which should stop blocking the REBALANCE_CONFIRM commands on the coordinator: https://github.com/danberindei/infinispan/tree/t_2825_m I haven't issued a PR yet because I'm still getting a failure in ClusterTopologyManagerTest, I think because of a JGroups issue (RSVP not receiving an ACK from itself). I'll let you know when I find out... > Weird thing: last night I tried more than 5x time in a row with UNICAST3 > and it never blocks. can this meaning a problem with UNICAST3 or I had just > lucky? > > Even though the REBALANCE_CONFIRM command is sent async, the message is still OOB. I think UNICAST/2/3 should not block any regular message waiting for the processing of an OOB message, as long as that message was received, so maybe the problem is in UNICAST2? > Any other suggestion? > > Cheers, > Pedro > > >
_______________________________________________ infinispan-dev mailing list [email protected] https://lists.jboss.org/mailman/listinfo/infinispan-dev
