On 02/26/2013 04:31 PM, Bela Ban wrote:
On 2/26/13 5:14 PM, Pedro Ruivo wrote:
So, in this case, the regular message will block until the OOB message is
delivered.
No, the regular message should get delivered as soon as the OOB message
has been *received* (not *delivered*). Unless there are previous regular
messages from the same sender which are delivered in the same thread,
and one of them is blocked in application code...
In attachment is part of the log. I only know that the response is
disappearing between UNICAST2 and the ISPN unmarshaller.
could you please take a look?
the response is being sent and received and I don't understand why ISPN
is not receive it
Thanks
Pedro
however, the OOB message is being block in the application
until the regular message is delivered. And there is no way to pick the
regular message from the window list while the OOB is blocked, right?
(assuming no more incoming messages)
This actually should happen, as they're delivered by different threads !
so, if everybody agrees, if I move the OOB message to another thread,
everything should work fine...
On 02/26/2013 03:50 PM, Bela Ban wrote:
On 2/26/13 4:15 PM, Dan Berindei wrote:
On Tue, Feb 26, 2013 at 12:57 PM, Pedro Ruivo <[email protected]
<mailto:[email protected]>> wrote:
hi,
I found the blocking problem with the state transfer this morning.
It happens because of the reordering of a regular and OOB message.
Below, is a simplification of what is happening for two nodes
A: total order broadcasts rebalance_start
B: (incoming thread) delivers rebalance_start
B: has no segments to request so the rebalance is done
B: sends async request with rebalance_confirm (unicast #x)
B: sends the rebalance_start response (unicast #x+1) (the response
is a regular message)
A: receives rebalance_start response (unicast #x+1)
A: in UNICAST2, it detects the message is out-of-order and blocks
the response in the sender window (i.e. the message #x is missing)
A: receives the rebalance_confirm (unicast #x)
A: delivers rebalance_confirm. Infinispan blocks this command
until all the rebalance_start responses are received ==> this
originates a deadlock! (because the response is blocked in unicast
layer)
Question: can the request's response message be sent always as
OOB? (I think the answer should be no...)
We could, if Bela adds the send(Message) method to the Response
interface...
I created a JIRA yesterday: https://issues.jboss.org/browse/JGRP-1602.
I'm wondering though if you *really* need it, as making all responses
OOB is a bad idea IMO, see below...
and personally I think it would be better to make all responses OOB
(as in JGroups 3.2.x). I don't have any data to back this up, though...
Intuitively, I think indiscriminatingly marking all responses as OOB
is bad, especially in the light of the async invocation API which will
make all messages non-blocking, at least in the OOB or reg thread pools.
The code in 3.3 *does* actually copy the flags of the request into the
response, so if the request is async (OOB), so will the response be.
For async RPCs (regular messages), you're not getting any response
anyway, so no worries here...
My suggestion: when I deliver a rebalance_confirm command (that it
is send async), can I move it to a thread in
async_thread_pool_executor?
I have WIP fix for https://issues.jboss.org/browse/ISPN-2825, which
should stop blocking the REBALANCE_CONFIRM commands on the
coordinator: https://github.com/danberindei/infinispan/tree/t_2825_m
I haven't issued a PR yet because I'm still getting a failure in
ClusterTopologyManagerTest, I think because of a JGroups issue (RSVP
not receiving an ACK from itself). I'll let you know when I find out...
Yes, please do that. I saw in London that you could reproduce it in
your test, so it should be simple to find the root cause.
Weird thing: last night I tried more than 5x time in a row with
UNICAST3 and it never blocks. can this meaning a problem with
UNICAST3 or I had just lucky?
Even though the REBALANCE_CONFIRM command is sent async, the message
is still OOB. I think UNICAST/2/3 should not block any regular
message waiting for the processing of an OOB message, as long as that
message was received, so maybe the problem is in UNICAST2?
If the OOB thread added the OOB message, then it will simply pass it
up. However, the regular thread needs to wait for gaps in the receiver
table to fill, as it doesn't know what type of message will be
received (could be regular).
As soon as the OOB message has been added to the table, the regular
message will get delivered
_______________________________________________
infinispan-dev mailing list
[email protected]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
### sending the rebalance_confirm (async)
2013-02-25 19:09:16,920 TRACE
(TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894)
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
Replication task sending CacheTopologyControlCommand{cache=___defaultcache,
type=REBALANCE_CONFIRM, sender=FullSyncWriteSkewTotalOrderTest-NodeB-36894,
joinInfo=null, topologyId=1, currentCH=null, pendingCH=null, throwable=null,
viewId=1} to single recipient FullSyncWriteSkewTotalOrderTest-NodeA-19450 with
response mode GET_NONE
2013-02-25 19:09:16,921 TRACE
(TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894)
[org.jgroups.blocks.Request] sending request (id=44)
2013-02-25 19:09:16,921 TRACE
(TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894)
[org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeB-36894
--> DATA(FullSyncWriteSkewTotalOrderTest-NodeA-19450: #7, conn_id=0)
2013-02-25 19:09:16,921 TRACE
(TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894)
[org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeB-36894:
sending msg to FullSyncWriteSkewTotalOrderTest-NodeA-19450,
src=FullSyncWriteSkewTotalOrderTest-NodeB-36894, headers are RequestCorrelator:
id=200, type=REQ, id=44, rsp_expected=false, UNICAST2: DATA, seqno=7, TCP:
[channel_name=ISPN]
2013-02-25 19:09:16,921 TRACE
(TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894)
[org.jgroups.protocols.TCP] dest=127.0.0.1:8100 (137 bytes)
### sending the response (by the same thread while processing the same command)
2013-02-25 19:09:16,921 TRACE
(TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894)
[org.jgroups.blocks.RequestCorrelator] sending rsp for 33 to
FullSyncWriteSkewTotalOrderTest-NodeA-19450
2013-02-25 19:09:16,921 TRACE
(TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894)
[org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeB-36894
--> DATA(FullSyncWriteSkewTotalOrderTest-NodeA-19450: #8, conn_id=0)
2013-02-25 19:09:16,921 TRACE
(TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894)
[org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeB-36894:
sending msg to FullSyncWriteSkewTotalOrderTest-NodeA-19450,
src=FullSyncWriteSkewTotalOrderTest-NodeB-36894, headers are RequestCorrelator:
id=200, type=RSP, id=33, rsp_expected=false, UNICAST2:DATA, seqno=8, TCP:
[channel_name=ISPN]
2013-02-25 19:09:16,921 TRACE
(TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894)
[org.jgroups.protocols.TCP] dest=127.0.0.1:8100 (82 bytes)
### in the A side, it starts to deliver the response
2013-02-25 19:09:16,922 TRACE
(Incoming-1,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeA-19450:
received [dst: FullSyncWriteSkewTotalOrderTest-NodeA-19450, src:
FullSyncWriteSkewTotalOrderTest-NodeB-36894 (3 headers), size=6 bytes,
flags=DONT_BUNDLE|NO_FC], headers are RequestCorrelator: id=200, type=RSP,
id=33, rsp_expected=false, UNICAST2: DATA, seqno=8, TCP: [channel_name=ISPN]
2013-02-25 19:09:16,922 TRACE
(Incoming-1,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeA-19450
<-- DATA(FullSyncWriteSkewTotalOrderTest-NodeB-36894: #8)
### I have no track of this response afte this point... then, the OOB message
with the new request
2013-02-25 19:09:16,922 TRACE
(OOB-2,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeA-19450:
received [dst: FullSyncWriteSkewTotalOrderTest-NodeA-19450, src:
FullSyncWriteSkewTotalOrderTest-NodeB-36894 (3 headers), size=61 bytes,
flags=OOB|DONT_BUNDLE|NO_TOTAL_ORDER|RSVP], headers are RequestCorrelator:
id=200, type=REQ, id=44, rsp_expected=false, UNICAST2: DATA, seqno=7, TCP:
[channel_name=ISPN]
2013-02-25 19:09:16,922 TRACE
(OOB-2,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeA-19450
<-- DATA(FullSyncWriteSkewTotalOrderTest-NodeB-36894: #7)
2013-02-25 19:09:16,922 TRACE
(OOB-2,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.jgroups.protocols.FC] sender FullSyncWriteSkewTotalOrderTest-NodeB-36894
minus 61 credits, 1999492 remaining
2013-02-25 19:09:16,922 TRACE
(OOB-2,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.jgroups.blocks.RequestCorrelator] calling
(org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher) with
request 44
2013-02-25 19:09:16,922 TRACE
(OOB-2,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.infinispan.marshall.jboss.AbstractJBossMarshaller] Start unmarshaller
after retrieving marshaller from thread local
### OOB-2 is blocked until RSP for id=33 is delivered... in the meanwhile, I
have a lot of Timers be triggered like these: (and the same for STABLE/NAKACK2)
2013-02-25 19:10:15,751 TRACE
(Timer-5,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeA-19450
--> STABLE(FullSyncWriteSkewTotalOrderTest-NodeA-19450: 2-2, conn_id=0)
2013-02-25 19:10:15,751 TRACE
(Timer-5,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeA-19450:
sending msg to FullSyncWriteSkewTotalOrderTest-NodeA-19450,
src=FullSyncWriteSkewTotalOrderTest-NodeA-19450, headers are UNICAST2: STABLE,
seqno=2, TCP: [channel_name=ISPN]
2013-02-25 19:10:15,751 TRACE
(Timer-5,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeA-19450:
looping back message [dst: FullSyncWriteSkewTotalOrderTest-NodeA-19450, src:
FullSyncWriteSkewTotalOrderTest-NodeA-19450 (2 headers), size=0 bytes,
flags=OOB]
2013-02-25 19:10:15,751 TRACE
(Timer-5,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeA-19450
--> STABLE(FullSyncWriteSkewTotalOrderTest-NodeB-36894: 6-8, conn_id=0)
2013-02-25 19:10:15,751 TRACE
(Timer-5,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeA-19450:
sending msg to FullSyncWriteSkewTotalOrderTest-NodeB-36894,
src=FullSyncWriteSkewTotalOrderTest-NodeA-19450, headers are UNICAST2: STABLE,
seqno=6, TCP: [channel_name=ISPN]
2013-02-25 19:10:15,751 TRACE
(OOB-8,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeA-19450:
received [dst: FullSyncWriteSkewTotalOrderTest-NodeA-19450, src:
FullSyncWriteSkewTotalOrderTest-NodeA-19450 (2 headers), size=0 bytes,
flags=OOB], headers are UNICAST2: STABLE, seqno=2, TCP: [channel_name=ISPN]
2013-02-25 19:10:15,751 TRACE
(OOB-8,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeA-19450
<-- STABLE(FullSyncWriteSkewTotalOrderTest-NodeA-19450: 2-2, conn_id=0)
2013-02-25 19:10:15,782 TRACE
(OOB-1,ISPN,FullSyncWriteSkewTotalOrderTest-NodeB-36894)
[org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeB-36894:
received message batch of 1 messages from
FullSyncWriteSkewTotalOrderTest-NodeA-19450
2013-02-25 19:10:15,782 TRACE
(OOB-1,ISPN,FullSyncWriteSkewTotalOrderTest-NodeB-36894)
[org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeB-36894
<-- STABLE(FullSyncWriteSkewTotalOrderTest-NodeA-19450: 6-8, conn_id=0)
### timeout occurs! these are the responses received for the request:
2013-02-25 19:13:16,758 TRACE
(asyncTransportThread-0,FullSyncWriteSkewTotalOrderTest-NodeA)
[org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher]
Responses:
[sender=FullSyncWriteSkewTotalOrderTest-NodeB-36894, received=false,
suspected=false]
[sender=FullSyncWriteSkewTotalOrderTest-NodeA-19450,
retval=SuccessfulResponse{responseValue=null} , received=true, suspected=false]
### OOB-2 continues and I have no track in the missing response...
2013-02-25 19:13:16,759 TRACE
(OOB-2,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450)
[org.infinispan.topology.RebalanceConfirmationCollector] Rebalance confirmation
collector 1@___de
faultcache received confirmation for
FullSyncWriteSkewTotalOrderTest-NodeB-36894, remaining list is
[FullSyncWriteSkewTotalOrderTest-NodeA-19450]
_______________________________________________
infinispan-dev mailing list
[email protected]
https://lists.jboss.org/mailman/listinfo/infinispan-dev