[jira] Commented: (AMQ-2149) Shared Filesystem Master Slave: missing messages

Aaron Riekenberg (JIRA) Thu, 12 Mar 2009 16:17:08 -0700

    [ 
https://issues.apache.org/activemq/browse/AMQ-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=50487#action_50487
 ]


Aaron Riekenberg commented on AMQ-2149:
---------------------------------------

Dejan -

Based on your comments, I tried a couple of tests.  In these tests I was 
running FUSE message broker 5.3.0.0.  I did not set the prefetch size, so it 
had the default value.  I did comment out the entire <systemUsage> stanza of 
the broker's configuration.

1. I tried running with the default broker shutdown rate of 20 seconds as in my 
original test, to test the effect of removing <systemUsage> only.  This failed 
after 3 broker failovers.  The activemq log for this run is attached as 
activemq.log.2009_03_12_1

{{Thu Mar 12 17:59:15 CDT 2009 started master broker pid 28845}}
{{Thu Mar 12 17:59:25 CDT 2009 started slave broker pid 29069}}
{{Thu Mar 12 17:59:35 CDT 2009 killing master broker pid 28845, new master pid 
29069}}
{{Thu Mar 12 17:59:55 CDT 2009 started slave broker pid 29285}}
{{Thu Mar 12 18:00:15 CDT 2009 killing master broker pid 29069, new master pid 
29285}}
{{Thu Mar 12 18:00:35 CDT 2009 started slave broker pid 29515}}
{{Thu Mar 12 18:00:55 CDT 2009 killing master broker pid 29285, new master pid 
29515}}

{{Mar 12, 2009 6:00:56 PM 
org.apache.activemq.transport.failover.FailoverTransport doReconnect}}
{{INFO: Successfully reconnected to tcp://localhost:61616}}
{{Mar 12, 2009 6:00:56 PM org.aaron.MasterSlaveTest$Receiver onMessage}}
{{WARNING: test.queue.8 received 520 expected 2712}}


2. Then I modified the script so it kills brokers every 60 seconds.  This also 
failed after 3 broker failovers.  The activemq log for this run is attached as 
activemq.log.2009_03_12_2

{{Thu Mar 12 18:03:34 CDT 2009 started master broker pid 29871}}
{{Thu Mar 12 18:03:44 CDT 2009 started slave broker pid 30090}}
{{Thu Mar 12 18:04:44 CDT 2009 killing master broker pid 29871, new master pid 
30090}}
{{Thu Mar 12 18:05:44 CDT 2009 started slave broker pid 30402}}
{{Thu Mar 12 18:06:44 CDT 2009 killing master broker pid 30090, new master pid 
30402}}
{{Thu Mar 12 18:07:44 CDT 2009 started slave broker pid 30725}}
{{Thu Mar 12 18:08:44 CDT 2009 killing master broker pid 30402, new master pid 
30725}}

{{Mar 12, 2009 6:08:46 PM 
org.apache.activemq.transport.failover.FailoverTransport doReconnect}}
{{INFO: Successfully reconnected to tcp://localhost:61616}}
{{Mar 12, 2009 6:08:46 PM org.aaron.MasterSlaveTest$Receiver onMessage}}
{{WARNING: test.queue.5 received 1049 expected 3205}}


> Shared Filesystem Master Slave: missing messages
> ------------------------------------------------
>
>                 Key: AMQ-2149
>                 URL: https://issues.apache.org/activemq/browse/AMQ-2149
>             Project: ActiveMQ
>          Issue Type: Bug
>    Affects Versions: 5.2.0
>         Environment: Ubuntu Linux 8.10 AMD64, Sun JDK 1.6.0.10
>            Reporter: Aaron Riekenberg
>         Attachments: activemq.log, activemq.log.2009_03_12_1, 
> activemq.log.2009_03_12_2, activemq.xml, MasterSlaveTest.java, 
> MasterSlaveTestWithTransactions.java, run_master_slave_brokers.sh
>
>
> I'm finding occasionally messages are not delivered in order in a shared 
> filesystem master slave setup when the master fails and the slave takes over. 
>  I'm running a simple test on one physical machine where the shared 
> filesystem is on a single disk (no SAN currently involved).
> I'm attaching a shell script (run_master_slave_brokers.sh) that starts a 
> master and slave broker in the same directory, sleeps 20 seconds, kills the 
> master, sleeps 20 seconds, starts a new slave, sleeps 20 seconds, kills the 
> master, etc.
> Also attached is a small java test program (MasterSlaveTest.java)  The 
> program starts 10 JMS senders that send 75kb text messages every 25 ms to 
> unique queues.  These messages contain a sequence number header (a long).  
> The program also starts 10 receivers (1 for each queue) that keep track of 
> the next expected sequence number and validate each incoming sequence number. 
>  If a receiver gets an unexpected sequence number, the test program exits 
> (System.exit(1)).  Both the senders and receivers use the failover transport 
> to connect to the broker.  Messages being sent are persistent, so in theory 
> there should be no message loss when the master fails and slave takes over.
> I run the script to start the brokers, then run my test program.  Most times 
> when the script kills the master and the slave is promoted, things work fine 
> - the test program reconnects, and messages continue to be delivered in 
> order.  If I run this long enough though, eventually my test program fails 
> just after a slave broker is promoted to master with output similar to this:
> Mar 6, 2009 11:58:12 AM 
> org.apache.activemq.transport.failover.FailoverTransport doReconnect
> INFO: Successfully reconnected to tcp://localhost:61616
> Mar 6, 2009 11:58:12 AM org.aaron.MasterSlaveTest$Receiver onMessage
> WARNING: test.queue.3 received 630 expected 629
> This indicates the receiver for test.queue.3 received message 630 after the 
> slave broker took over and missed message 629.
> This seems to happen more often when more senders and receivers are running 
> and more queues are in use.  If I run a single sender/receiver pair on 1 
> queue, it is very difficult to make this happen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AMQ-2149) Shared Filesystem Master Slave: missing messages

Reply via email to