Shared Filesystem Master Slave: missing messages
------------------------------------------------

                 Key: AMQ-2149
                 URL: https://issues.apache.org/activemq/browse/AMQ-2149
             Project: ActiveMQ
          Issue Type: Bug
    Affects Versions: 5.2.0
         Environment: Ubuntu Linux 8.10 AMD64, Sun JDK 1.6.0.10
            Reporter: Aaron Riekenberg


I'm finding occasionally messages are delivered out-of-order in a shared 
filesystem master slave setup when the master fails and the slave takes over.  
I'm running a simple test on one physical machine where the shared filesystem 
is on a single disk (no SAN currently involved).

I'm attaching a shell script (run_master_slave_brokers.sh) that starts a master 
and slave broker in the same directory, sleeps 20 seconds, kills the master, 
sleeps 20 seconds, starts a new slave, sleeps 20 seconds, kills the master, etc.

Also attached is a small java test program (MasterSlaveTest.java)  The program 
starts 10 JMS senders that send 75kb text messages every 25 ms to unique 
queues.  These messages contain a sequence number header (a long).  The program 
also starts 10 receivers (1 for each queue) that keep track of the next 
expected sequence number and validate each incoming sequence number.  If a 
receiver gets an unexpected sequence number, the test program exits 
(System.exit(1)).  Both the senders and receivers use the failover transport to 
connect to the broker.  Messages being sent are persistent, so in theory there 
should be no message loss when the master fails and slave takes over.

I run the script to start the brokers, then run my test program.  Most times 
when the script kills the master and the slave is promoted, things work fine - 
the test program reconnects, and messages continue to be delivered in order.  
If I run this long enough though, eventually my test program fails just after a 
slave broker is promoted to master with output similar to this:


Mar 6, 2009 11:58:12 AM 
org.apache.activemq.transport.failover.FailoverTransport doReconnect
INFO: Successfully reconnected to tcp://localhost:61616
Mar 6, 2009 11:58:12 AM org.aaron.MasterSlaveTest$Receiver onMessage
WARNING: test.queue.3 received 630 expected 629


This indicates the receiver received message 630 after the slave broker took 
over.  This means the receiver missed message 629.

This seems to happen more often when more senders and receivers are running and 
more queues are in use.  If I run a single sender/receiver pair on 1 queue, it 
is very difficult to make this happen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to