[jira] Updated: (AMQ-2149) Shared Filesystem Master Slave: missing messages

Gary Tully (JIRA) Thu, 26 Mar 2009 09:30:37 -0700

     [ 
https://issues.apache.org/activemq/browse/AMQ-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gary Tully updated AMQ-2149:
----------------------------

    Attachment: amq2149.patch

here is a patch for this issue. There were a bunch of related problems. 
Duplicate messages from that failover transport are replayed when the 
connection does down during a message send, the message is replayed. This can 
then be a duplicate.

One problem with the persisted hashindex that backs the reference store could 
result in a spurious message reference and "could not be recovered, already 
dispatched message". 
In addition, if the duplicate message is received before the existing message 
is dispatched and acked, the duplicate reference remains and will eventually 
get dispatched on restart or when the duplicate checker exceeds its range.
Eliminating the duplicate reference at source in the reference store resolves 
both these issues.
When memory limits are reached, it was possible for a stores messages to be 
exhausted (cursor gets to the end) causing dispatch to halt until a restart. 
Dealing with recovery failure due to no space resolves this.
A consumer ack of a duplicate message could be replaced with a delivery ack and 
get lost, thus leaving the duplicate to be redsipatched on restart and 
delivered to the consumer when the client side message audit is exceeded. This 
explains redelivery of old messages. Ensuring each duplicate is acked in turn 
resolves this.
Finally, an unmatched ack can occur if recovery dispatch has not yet happened 
after a restart and an ack is received from a consumer as it was outstanding. 
In this case, the subscription can wait for recovery and a dispatch to kick in 
such that it will have a record of the target message. 


> Shared Filesystem Master Slave: missing messages
> ------------------------------------------------
>
>                 Key: AMQ-2149
>                 URL: https://issues.apache.org/activemq/browse/AMQ-2149
>             Project: ActiveMQ
>          Issue Type: Bug
>    Affects Versions: 5.2.0
>         Environment: Ubuntu Linux 8.10 AMD64, Sun JDK 1.6.0.10
>            Reporter: Aaron Riekenberg
>            Assignee: Gary Tully
>         Attachments: activemq.log, activemq.log.2009_03_12_1, 
> activemq.log.2009_03_12_2, activemq.xml, AMQ-2149.zip, amq2149.patch, 
> MasterSlaveTest.java, MasterSlaveTestWithTransactions.java, 
> run_master_slave_brokers.sh
>
>
> I'm finding occasionally messages are not delivered in order in a shared 
> filesystem master slave setup when the master fails and the slave takes over. 
>  I'm running a simple test on one physical machine where the shared 
> filesystem is on a single disk (no SAN currently involved).
> I'm attaching a shell script (run_master_slave_brokers.sh) that starts a 
> master and slave broker in the same directory, sleeps 20 seconds, kills the 
> master, sleeps 20 seconds, starts a new slave, sleeps 20 seconds, kills the 
> master, etc.
> Also attached is a small java test program (MasterSlaveTest.java)  The 
> program starts 10 JMS senders that send 75kb text messages every 25 ms to 
> unique queues.  These messages contain a sequence number header (a long).  
> The program also starts 10 receivers (1 for each queue) that keep track of 
> the next expected sequence number and validate each incoming sequence number. 
>  If a receiver gets an unexpected sequence number, the test program exits 
> (System.exit(1)).  Both the senders and receivers use the failover transport 
> to connect to the broker.  Messages being sent are persistent, so in theory 
> there should be no message loss when the master fails and slave takes over.
> I run the script to start the brokers, then run my test program.  Most times 
> when the script kills the master and the slave is promoted, things work fine 
> - the test program reconnects, and messages continue to be delivered in 
> order.  If I run this long enough though, eventually my test program fails 
> just after a slave broker is promoted to master with output similar to this:
> Mar 6, 2009 11:58:12 AM 
> org.apache.activemq.transport.failover.FailoverTransport doReconnect
> INFO: Successfully reconnected to tcp://localhost:61616
> Mar 6, 2009 11:58:12 AM org.aaron.MasterSlaveTest$Receiver onMessage
> WARNING: test.queue.3 received 630 expected 629
> This indicates the receiver for test.queue.3 received message 630 after the 
> slave broker took over and missed message 629.
> This seems to happen more often when more senders and receivers are running 
> and more queues are in use.  If I run a single sender/receiver pair on 1 
> queue, it is very difficult to make this happen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (AMQ-2149) Shared Filesystem Master Slave: missing messages

Reply via email to