Failover causes duplicate messages
----------------------------------

                 Key: AMQ-2627
                 URL: https://issues.apache.org/activemq/browse/AMQ-2627
             Project: ActiveMQ
          Issue Type: Bug
          Components: Broker
    Affects Versions: 5.3.0
         Environment: Server: 2 RHEL 5.3 x86-64 machines. Kernel version 
2.6.18-128.0.0.0.2.el5.
Client: Same as above. Also tested with same results on Fedora Core 11
            Reporter: Josh Carlson
            Priority: Blocker


When using a shared file system master/server activemq configuration and client 
acknoledgements we run into a problem when
our clients fail over to a new server. The problem is that the new server does 
not appear to have any knowledge of pending
messages that the old server had dispatched to clients. Consequently all of 
these pending messages get dispatched a second
time even though the clients had acknowledged them.

Please confirm my suspicion that this is a server side bug and if there are any 
suggestions for working around this issue so that it might work. I have put 
this at Priority 'Blocker' because it blocks our progress towards deploying an 
ActiveMQ solution to our infrastructure. 

If you look at the log file from the new broker you can see that the ack for 
those messages do not get matched:

   2010-02-24 12:46:49,759 | WARN  | Async error occurred: 
javax.jms.JMSException: Unmatched acknowledege:

I do not know whether this gets bubbled up to the client or not. If it does it 
must be under the hood in activemq-cpp
because from the application layer I do not see any errors. In our in house 
Perl Stomp client we wind up getting an ERROR
frame which it did not know what to do with. This is where I intially ran into 
this problem. Today is my first day using
CMS to attempt to verify if the bug is independent of the client and to provide 
a reproducer using a client everyone
should have ready access to.

The attached tar file will contain the following details for reproducing this 
problem.

Contents:

   README.txt                   - This File
   activemq_1.xml               - ActiveMQ config for the server that was 
master at the time I started the consumer
   activemq_2.xml               - ActiveMQ config for the broker which became 
the master after the original master failed
   activemq_1.log               - Log file from the first server
   activemq_2.log               - Log for the second server
   producers/SimpleProducer.cpp - Modified version of program shipped in 
activemq-cpp-library-3.1.0 to
                                  send only 2 messages and provide two broker 
hosts on the command line.
   consumers/SimpleConsumer.cpp - New file ... but really just a modified 
version of SimpleAsyncConsumer shipped with
                                  activemq-cpp-library-3.1.0. Modified as 
follows:
                                     - Retrieves messages synchronously and in 
one thread (so we can see what is going on)
                                     - Takes two command line options to name 
broker hosts to use in broker URI
                                     - Uses Client Acknoledgements.
                                     - After retrieving a message it blocks 
waiting for standard input (so one has time to go kill the server)
    Makefile.am                 - Modified version of the makefile to build the 
new SimpleConsumer program.
    
    
Note that the build for these files require that they be built from inside a 
activemq-cpp build tree. So the first step to reproduce this problem would be 
to copy producers/SimpleProducer.cpp consumers/SimpleConsumer.cpp and 
Makefile.am to your src/examples directory. Then run a top level, configure and 
make. I ran this using activemq-cpp-library version 3.1.0
    
This reproducer expects that you only have 2 activemq brokers and that they be 
configured using a shared file system master/slave configuration. It also 
expects an openwire transport connector listening on port 61616 on those two 
machines. (Note: you'll see my activemq configs using the transport uri: 
uri="tcp://q1masterhost:61616", q1masterhost goes to the ethernet 0 interface 
on each of the hosts.)

Once you have those two brokers set up and running. Go ahead and run the 
simple_producer code passing the hostnames of your two brokers on the command 
line:

        [jcarl...@rocky examples]$ ./simple_producer mmq1 mmq2
        =====================================================
        Starting the example:
        -----------------------------------------------------
        Sent message #1 from thread 139817389041504
        Sent message #2 from thread 139817389041504
        -----------------------------------------------------
        Finished with the example.
        =====================================================

Now do the same for the simple_consumer:

        [jcarl...@rocky examples]$ ./simple_consumer mmq1 mmq2
        =====================================================
        Starting the example:
        -----------------------------------------------------
        Message #1 Received: Hello world! from thread 139817389041504
        Waiting for stdin to acknoledge

The app has retrieved one message but has not ack'ed it yet. Now go identify
which host has the master broker and kill the process. The master broker will
be the one which is *not* printing 'Database [lockfile] is locked' messages.

In my case the broker was on mmq1 so I did this in another terminal:

        ssh -t mmq1 sudo pkill java

Immediatly I see this in the console I started the consumer in:

  The Connection's Transport has been Interrupted.

and then a few seconds later I see:

  The Connection's Transport has been Restored.

At this point I hit enter in the terminal so that the message I recieved on
the other broker gets acknoledged and the consumer trys to get another message

  Message #2 Received: Hello world! from thread 139817389041504
  Waiting for stdin to acknoledge

Ok at this point, since I have only put two messages on the queue I don't
expect any more so when I hit enter and go back to get another message I
expect it to just sit and wait for another message to come in. This is not
what happens. A third message is retrieved:

  Message #3 Received: Hello world! from thread 139817389041504
  Waiting for stdin to acknoledge

At this point when I hit enter again the app blocks and I kill it with Cntrl
C.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to