OK, put on your thinking caps for this one, it's a good one

I had a situation where our database was experiencing user locks for a specific 
sql transaction that was causing fast threads to get hung in a waiting state, 
some for days, so for the background:

Our environment:
We are running  a four RHEL 6 servers with 4 quad core CPUs and 96GB of ram , 
two are running ARS version 7.6.04 SP4 in a server group with an Oracle 10g  
database, two are Weblogic 10G for Mid-Tier and JSS XMLGateway.  We are a fully 
custom shop, no OOB, virtually no user interaction, a backend alarm processing 
system using mostly web services and JMS for communications in and out.  We 
have alarm receivers that use JMS queues that are polled by XMLGateway and it 
makes API calls to create a record in our alarm transaction form, which in the 
same process does a create of an Incident, which gets sent to the user 
ticketing system via an outbound Web Service Call.  During the create of the 
incident, we do a set fields from a view form that points to a table in a 
remote oracle DB that we access via a database link.  This remote DB lookup is 
done using the primary key/unique index in the where clause, Very fast, usually 
sub 100ms, even over the DB link.  We have been running with a max of 50 fast 
threads and a max of 75 list threads.

(I know, a convoluted process, I could spend weeks explaining the history)

What happened:
About a week and a half ago we noticed an issue where we were getting timeouts 
trying to perform inserts and updates via our Alarm receivers API app, and 
about this time we were also getting stuck threads in our Mid Tier servers as 
well.  I started looking and we were seeing a growth in the number of fast 
threads in use and we were beginning to see the number of blocked items in the 
fast queue growing, so we restarted the app servers and all was good, after 2 
days we are out of threads again, so I upped the max to 100 fast threads.  
Three to 4 days later, the same issue, all 100 were used, so this was a Sunday 
and I was not at home so we set the max fast threads to 200 until I could get 
back on Monday.  So we initial thought it was caused by an increase of incoming 
alarms, which had doubled the first of August, but on Tuesday, we got a call 
from our DBA and we had over 1000 sessions in the database that were 
experiencing a user lock, and all 1000 had the exact same sql statement.  We 
started investigating and this statement was the lookup from the remote 
database table where we get 3rd party vendor information for the site a device 
is associated with.  As I said, all 1000 were for the exact same site.  I 
started looking through SQL and filter logs and determined that when the filter 
that runs to do the set fields ran, the thread never logged any other actions 
beyond the sql statement, it just hung, no timeout, no error, just hung never 
to respond again.  As we were clearing these sessions, we noticed that some 
were for remedy process IDs that were no longer running.  After clearing about 
400, we stumbled on the one that seemingly had locked all the subsequent 
sessions querying that record, because about 300 inserts and updates completed 
for alarms related to that site in like 30 seconds.

So my questions are:


1.       Has anyone ever experienced this type of issue where a database 
session just hung and neither Oracle nor Remedy generated a timeout?

2.       Does anyone have any idea of a way to detect these before it gets real 
bad, other than looking for steady thread growth?


Paul Campbell  | Development Team Lead  |  TS&D SSBL, A2R WFE, and ESP Remedy 
Team |  Avaya Client Services  |
|  1145 Sanctuary Parkway Lake View II  Suite 110 Alpharetta, GA  30009  | 
678-421-5342

Everyone needs deadlines.  Even the beavers. They loaf around all summer, but 
when they are faced with the winter deadline, they work like fury. If we didn't 
have deadlines, we'd stagnate.  Walt Disney



_______________________________________________________________________________
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
"Where the Answers Are, and have been for 20 years"

Reply via email to