[jira] Commented: (AMQ-1925) JDBC-Master/Slave Failover - Consumer stop after 1000 Messages

Mario Siegenthaler (JIRA) Wed, 10 Sep 2008 02:33:15 -0700

    [ 
https://issues.apache.org/activemq/browse/AMQ-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=45580#action_45580
 ]


Mario Siegenthaler commented on AMQ-1925:
-----------------------------------------

I've looked into it again. My first guess wasn't completly off spot, although 
the client clear it's buffers (Session#clearMessagesInProgress) it still has a 
message in progress namely the one of the current transaction. So when this 
message gets ack'd during the recover (the transaction is commited there) the 
PrefetchSubscription#acknowlegde gets confused.
It's inAckRange algorithm runs wild because ack.getFirstMessageId==null and the 
lastMessageId isn't in the dispatch-list (with the "old"-master it'd have been 
the first entry). So what it does is run to the complete dispatch list and 
ack's everything. This is a very bad move because it in fact deletes every 
message in the dispatch list permanently (around 100 messages; also from the 
DB)!!
It even shows the message "Could not correlate acknowledgment with dispatched 
message", but only at debug level.

Of course the dispatching gets completey out of sync afterwards, since all the 
messages in the broker's dispatched-list are now gone... so every single ack 
afterward clears (aka deletes from the database) the whole dispatched-list 
(mostly around 100 messages). After some ack's

Note that I only see this happening with transactional consumers, only then I 
get a transaction-ack-replay that then goes wrong. Also note that it does not 
happen in the current trunk or the 5.1 release because they fail on AMQ-1710 
("transaction X has not been started")


> JDBC-Master/Slave Failover - Consumer stop after 1000 Messages
> --------------------------------------------------------------
>
>                 Key: AMQ-1925
>                 URL: https://issues.apache.org/activemq/browse/AMQ-1925
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 5.1.0
>            Reporter: Mario Siegenthaler
>         Attachments: heapdump-1220373534484.hprof, 
> threaddump-1220371256910.tdump
>
>
> In a JDBC-Master/Slave Environment with ActiveMQ 5.1.0 (+patches for 1710 und 
> 1838) the failover for consumers works, the consumers resume to get messages 
> after the failover but then the suddenly stop after approx. 1000 messages 
> (mostly 1000, one got to 1080). The consumers are using transacted sessions.
> The thread dump look unsuspicious, everybody is waiting on the Socket
>    java.lang.Thread.State: RUNNABLE
>       at java.net.SocketInputStream.socketRead0(Native Method)
>       at java.net.SocketInputStream.read(SocketInputStream.java:129)
>       at 
> org.apache.activemq.transport.tcp.TcpBufferedInputStream.fill(TcpBufferedInputStream.java:50)
>       at 
> org.apache.activemq.transport.tcp.TcpBufferedInputStream.read(TcpBufferedInputStream.java:58)
>       at java.io.DataInputStream.readInt(DataInputStream.java:370)
>       at 
> org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:269)
>       at 
> org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:203)
>       at 
> org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:195)
>       at 
> org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:183)
>       at java.lang.Thread.run(Thread.java:619)
> A memory dump from the consumers shows that they've really run out of 
> messages and are waiting for the broker to deliver new ones. I've attached 
> both the thread dump and the heap dump to this issue (or better: I'll do so :)
> The broker doesn't do anything (also waits on the transport-socket), the 
> queue has a full page-in buffer (100 messages) but obviously fails to do 
> anything with it. If I manually trigger a doDispatch of all pagedIn messages 
> (via the debugger, just a try to revive the thing) it returns doing nothing 
> at all, since all subscriptions are full (s.isFull). I further investigated 
> the issue and was confused to see the prefetchExtension field of the 
> PrefetchSubscription having a value of -1000 (negative!). This explains why 
> it was considered full:
>   dispatched.size() - prefetchExtension >= info.getPrefetchSize()
>   0 - (-1000) >= 1000
> quite nasty.. so even though the dispatched size was zero the client didn't 
> receive any new messages.
> The only place this value can become negative is inside acknowledge, where 
> it's decremented (prefetchExtension--), all other places do a Math.max(0, X).
> So here's my guess what happened: The client had a full (1000 messages) 
> prefetch buffer when I killed my master. As soon as the slave was done 
> starting they reconnected and started processing the messages in the prefetch 
> and acknowleding them. This gradually decremented the counter into a negative 
> value because the slave never got a chance to increment the prefetchExtension 
> since it didn't action delivery those messages.
> Possible solutions:
> - clear the prefetch buffer on a failover
> - just don't allow this value to become smaller than zero (not sure if that 
> covers all bases)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AMQ-1925) JDBC-Master/Slave Failover - Consumer stop after 1000 Messages

Reply via email to