Failover causes duplicate messages
----------------------------------
Key: AMQ-2627
URL: https://issues.apache.org/activemq/browse/AMQ-2627
Project: ActiveMQ
Issue Type: Bug
Components: Broker
Affects Versions: 5.3.0
Environment: Server: 2 RHEL 5.3 x86-64 machines. Kernel version
2.6.18-128.0.0.0.2.el5.
Client: Same as above. Also tested with same results on Fedora Core 11
Reporter: Josh Carlson
Priority: Blocker
When using a shared file system master/server activemq configuration and client
acknoledgements we run into a problem when
our clients fail over to a new server. The problem is that the new server does
not appear to have any knowledge of pending
messages that the old server had dispatched to clients. Consequently all of
these pending messages get dispatched a second
time even though the clients had acknowledged them.
Please confirm my suspicion that this is a server side bug and if there are any
suggestions for working around this issue so that it might work. I have put
this at Priority 'Blocker' because it blocks our progress towards deploying an
ActiveMQ solution to our infrastructure.
If you look at the log file from the new broker you can see that the ack for
those messages do not get matched:
2010-02-24 12:46:49,759 | WARN | Async error occurred:
javax.jms.JMSException: Unmatched acknowledege:
I do not know whether this gets bubbled up to the client or not. If it does it
must be under the hood in activemq-cpp
because from the application layer I do not see any errors. In our in house
Perl Stomp client we wind up getting an ERROR
frame which it did not know what to do with. This is where I intially ran into
this problem. Today is my first day using
CMS to attempt to verify if the bug is independent of the client and to provide
a reproducer using a client everyone
should have ready access to.
The attached tar file will contain the following details for reproducing this
problem.
Contents:
README.txt - This File
activemq_1.xml - ActiveMQ config for the server that was
master at the time I started the consumer
activemq_2.xml - ActiveMQ config for the broker which became
the master after the original master failed
activemq_1.log - Log file from the first server
activemq_2.log - Log for the second server
producers/SimpleProducer.cpp - Modified version of program shipped in
activemq-cpp-library-3.1.0 to
send only 2 messages and provide two broker
hosts on the command line.
consumers/SimpleConsumer.cpp - New file ... but really just a modified
version of SimpleAsyncConsumer shipped with
activemq-cpp-library-3.1.0. Modified as
follows:
- Retrieves messages synchronously and in
one thread (so we can see what is going on)
- Takes two command line options to name
broker hosts to use in broker URI
- Uses Client Acknoledgements.
- After retrieving a message it blocks
waiting for standard input (so one has time to go kill the server)
Makefile.am - Modified version of the makefile to build the
new SimpleConsumer program.
Note that the build for these files require that they be built from inside a
activemq-cpp build tree. So the first step to reproduce this problem would be
to copy producers/SimpleProducer.cpp consumers/SimpleConsumer.cpp and
Makefile.am to your src/examples directory. Then run a top level, configure and
make. I ran this using activemq-cpp-library version 3.1.0
This reproducer expects that you only have 2 activemq brokers and that they be
configured using a shared file system master/slave configuration. It also
expects an openwire transport connector listening on port 61616 on those two
machines. (Note: you'll see my activemq configs using the transport uri:
uri="tcp://q1masterhost:61616", q1masterhost goes to the ethernet 0 interface
on each of the hosts.)
Once you have those two brokers set up and running. Go ahead and run the
simple_producer code passing the hostnames of your two brokers on the command
line:
[jcarl...@rocky examples]$ ./simple_producer mmq1 mmq2
=====================================================
Starting the example:
-----------------------------------------------------
Sent message #1 from thread 139817389041504
Sent message #2 from thread 139817389041504
-----------------------------------------------------
Finished with the example.
=====================================================
Now do the same for the simple_consumer:
[jcarl...@rocky examples]$ ./simple_consumer mmq1 mmq2
=====================================================
Starting the example:
-----------------------------------------------------
Message #1 Received: Hello world! from thread 139817389041504
Waiting for stdin to acknoledge
The app has retrieved one message but has not ack'ed it yet. Now go identify
which host has the master broker and kill the process. The master broker will
be the one which is *not* printing 'Database [lockfile] is locked' messages.
In my case the broker was on mmq1 so I did this in another terminal:
ssh -t mmq1 sudo pkill java
Immediatly I see this in the console I started the consumer in:
The Connection's Transport has been Interrupted.
and then a few seconds later I see:
The Connection's Transport has been Restored.
At this point I hit enter in the terminal so that the message I recieved on
the other broker gets acknoledged and the consumer trys to get another message
Message #2 Received: Hello world! from thread 139817389041504
Waiting for stdin to acknoledge
Ok at this point, since I have only put two messages on the queue I don't
expect any more so when I hit enter and go back to get another message I
expect it to just sit and wait for another message to come in. This is not
what happens. A third message is retrieved:
Message #3 Received: Hello world! from thread 139817389041504
Waiting for stdin to acknoledge
At this point when I hit enter again the app blocks and I kill it with Cntrl
C.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.