OK, put on your thinking caps for this one, it's a good one I had a situation where our database was experiencing user locks for a specific sql transaction that was causing fast threads to get hung in a waiting state, some for days, so for the background:
Our environment: We are running a four RHEL 6 servers with 4 quad core CPUs and 96GB of ram , two are running ARS version 7.6.04 SP4 in a server group with an Oracle 10g database, two are Weblogic 10G for Mid-Tier and JSS XMLGateway. We are a fully custom shop, no OOB, virtually no user interaction, a backend alarm processing system using mostly web services and JMS for communications in and out. We have alarm receivers that use JMS queues that are polled by XMLGateway and it makes API calls to create a record in our alarm transaction form, which in the same process does a create of an Incident, which gets sent to the user ticketing system via an outbound Web Service Call. During the create of the incident, we do a set fields from a view form that points to a table in a remote oracle DB that we access via a database link. This remote DB lookup is done using the primary key/unique index in the where clause, Very fast, usually sub 100ms, even over the DB link. We have been running with a max of 50 fast threads and a max of 75 list threads. (I know, a convoluted process, I could spend weeks explaining the history) What happened: About a week and a half ago we noticed an issue where we were getting timeouts trying to perform inserts and updates via our Alarm receivers API app, and about this time we were also getting stuck threads in our Mid Tier servers as well. I started looking and we were seeing a growth in the number of fast threads in use and we were beginning to see the number of blocked items in the fast queue growing, so we restarted the app servers and all was good, after 2 days we are out of threads again, so I upped the max to 100 fast threads. Three to 4 days later, the same issue, all 100 were used, so this was a Sunday and I was not at home so we set the max fast threads to 200 until I could get back on Monday. So we initial thought it was caused by an increase of incoming alarms, which had doubled the first of August, but on Tuesday, we got a call from our DBA and we had over 1000 sessions in the database that were experiencing a user lock, and all 1000 had the exact same sql statement. We started investigating and this statement was the lookup from the remote database table where we get 3rd party vendor information for the site a device is associated with. As I said, all 1000 were for the exact same site. I started looking through SQL and filter logs and determined that when the filter that runs to do the set fields ran, the thread never logged any other actions beyond the sql statement, it just hung, no timeout, no error, just hung never to respond again. As we were clearing these sessions, we noticed that some were for remedy process IDs that were no longer running. After clearing about 400, we stumbled on the one that seemingly had locked all the subsequent sessions querying that record, because about 300 inserts and updates completed for alarms related to that site in like 30 seconds. So my questions are: 1. Has anyone ever experienced this type of issue where a database session just hung and neither Oracle nor Remedy generated a timeout? 2. Does anyone have any idea of a way to detect these before it gets real bad, other than looking for steady thread growth? Paul Campbell | Development Team Lead | TS&D SSBL, A2R WFE, and ESP Remedy Team | Avaya Client Services | | 1145 Sanctuary Parkway Lake View II Suite 110 Alpharetta, GA 30009 | 678-421-5342 Everyone needs deadlines. Even the beavers. They loaf around all summer, but when they are faced with the winter deadline, they work like fury. If we didn't have deadlines, we'd stagnate. Walt Disney _______________________________________________________________________________ UNSUBSCRIBE or access ARSlist Archives at www.arslist.org "Where the Answers Are, and have been for 20 years"