[jira] [Created] (AMQ-4954) PooledConnectionFactory Race Condition & Failover Probems

Jeremy Lucier (JIRA) Tue, 24 Dec 2013 06:14:03 -0800

Jeremy Lucier created AMQ-4954:
----------------------------------

             Summary: PooledConnectionFactory Race Condition & Failover Probems
                 Key: AMQ-4954
                 URL: https://issues.apache.org/jira/browse/AMQ-4954
             Project: ActiveMQ
          Issue Type: Bug
          Components: JMS client
    Affects Versions: 5.9.0
         Environment: Linux 64bit, 8-core, 16gb RAM.
            Reporter: Jeremy Lucier



[Sorry for this long bug report]

We're currently using ActiveMQ 5.9 via a Camel implementation for 
consuming/polling messages off of queues as well as sending messages.  With the 
latest version of ActiveMQ both client and server-side (5.9), there are 
numerous problems we've discovered with the changes to PooledConnectionFactory. 
 These problems might have existed prior, but it wasn't until recently that we 
were able to do any basic load testing on our system.

[1] The first one is a race condition, which is actually documented within the 
code itself via a comment (createConnection method within 
PooledConnectionFactory).  If we start our Camel-based application while the 
broker is down, and then bring the broker up -- it will take an indeterminable 
amount of time to reconnect (typically a few hours of thrashing).  The reason 
being is that on start up or failover all the routes/threads are attempting to 
get a connection and establish a new session via PooledConnectionFactory in 
unison over and over again.  That's fine and expected behavior, but leads us to 
the next problem that contributes to the race condition.

[2] The PooledConnectionFactory continues to return expired/expiring 
ConnectionPool's, and the calling threads fight to increment/decrement the 
internal references counter within that object.  Eventually they might win and 
hit 0 for the expiredCheck to return true, but again it's random and usually 
after a long period of time.  It's very easy to have happen if you only have 1 
connection in your pool.

The fix that I originally implemented in my own code, was to just add an 
"isExpiring" flag to the ConnectionPool and just continue to throw the bad 
connection back in the pool with the hopes of it invalidating as the while loop 
continued to grab another connection (to keep things as close to the original 
behavior as possible). That fixed the startup with the broker down problem, 
however introduced another problem if failover occurs after good connection(s) 
are established .  Basically I learned that the PooledConnectionFactory never 
validated/started/established a working connection (it has hooks for a 
connection error to set it as expiring, but the start up and validation happens 
later on in the workflow). That lead to the DefaultJmsMessageContainer 
continually flagging the connection as bad, and continually trying to refresh 
it.  Which basically ends up with the system creating a single connection with 
one session, and then it terminates it after it's used over and over again.  
Not ideal since we're looking to pool.

So I added validation to PooledConnectionFactory's cache itself to start the 
connection if it's a newly created ConnectionPool to ensure it works.  That 
worked great, except, the problem is createConnection is synch'd and has a 
while loop that under certain conditions of failover leads to the 
createConnection blocking all connection threads indefinitely.  Introducing a 
max limit to the loop addresses that (or a similar base case), but really it's 
not ideal either since you'd assume all things in the cache are valid.

In general what ended up being a short "look around" to fix a race condition 
ended up with me redoing quite a bit of functionality, and redesigning how a 
"close" and cache destroy occurs on a ConnectionPool.  I think a lot of this 
was done initially in an attempt to maximize some of the features of the cache 
implementation that's in place.  I can get my code modifications pulled down 
from my system and attached to this ticket if interested, otherwise I just 
wanted to raise some of the issues to your guys' attention.

Hopefully this made sense, otherwise I can clarify further.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (AMQ-4954) PooledConnectionFactory Race Condition & Failover Probems

Reply via email to