Jeremy Lucier created AMQ-4954:
----------------------------------
Summary: PooledConnectionFactory Race Condition & Failover Probems
Key: AMQ-4954
URL: https://issues.apache.org/jira/browse/AMQ-4954
Project: ActiveMQ
Issue Type: Bug
Components: JMS client
Affects Versions: 5.9.0
Environment: Linux 64bit, 8-core, 16gb RAM.
Reporter: Jeremy Lucier
[Sorry for this long bug report]
We're currently using ActiveMQ 5.9 via a Camel implementation for
consuming/polling messages off of queues as well as sending messages. With the
latest version of ActiveMQ both client and server-side (5.9), there are
numerous problems we've discovered with the changes to PooledConnectionFactory.
These problems might have existed prior, but it wasn't until recently that we
were able to do any basic load testing on our system.
[1] The first one is a race condition, which is actually documented within the
code itself via a comment (createConnection method within
PooledConnectionFactory). If we start our Camel-based application while the
broker is down, and then bring the broker up -- it will take an indeterminable
amount of time to reconnect (typically a few hours of thrashing). The reason
being is that on start up or failover all the routes/threads are attempting to
get a connection and establish a new session via PooledConnectionFactory in
unison over and over again. That's fine and expected behavior, but leads us to
the next problem that contributes to the race condition.
[2] The PooledConnectionFactory continues to return expired/expiring
ConnectionPool's, and the calling threads fight to increment/decrement the
internal references counter within that object. Eventually they might win and
hit 0 for the expiredCheck to return true, but again it's random and usually
after a long period of time. It's very easy to have happen if you only have 1
connection in your pool.
The fix that I originally implemented in my own code, was to just add an
"isExpiring" flag to the ConnectionPool and just continue to throw the bad
connection back in the pool with the hopes of it invalidating as the while loop
continued to grab another connection (to keep things as close to the original
behavior as possible). That fixed the startup with the broker down problem,
however introduced another problem if failover occurs after good connection(s)
are established . Basically I learned that the PooledConnectionFactory never
validated/started/established a working connection (it has hooks for a
connection error to set it as expiring, but the start up and validation happens
later on in the workflow). That lead to the DefaultJmsMessageContainer
continually flagging the connection as bad, and continually trying to refresh
it. Which basically ends up with the system creating a single connection with
one session, and then it terminates it after it's used over and over again.
Not ideal since we're looking to pool.
So I added validation to PooledConnectionFactory's cache itself to start the
connection if it's a newly created ConnectionPool to ensure it works. That
worked great, except, the problem is createConnection is synch'd and has a
while loop that under certain conditions of failover leads to the
createConnection blocking all connection threads indefinitely. Introducing a
max limit to the loop addresses that (or a similar base case), but really it's
not ideal either since you'd assume all things in the cache are valid.
In general what ended up being a short "look around" to fix a race condition
ended up with me redoing quite a bit of functionality, and redesigning how a
"close" and cache destroy occurs on a ConnectionPool. I think a lot of this
was done initially in an attempt to maximize some of the features of the cache
implementation that's in place. I can get my code modifications pulled down
from my system and attached to this ticket if interested, otherwise I just
wanted to raise some of the issues to your guys' attention.
Hopefully this made sense, otherwise I can clarify further.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)