I already have the Request ID Block Size set to 50 on this form and the form it does a push to. I've seen transaction locks on a insert/update statements before, but not on a select.
From: Action Request System discussion list(ARSList) [mailto:arslist@ARSLIST.ORG] On Behalf Of Grooms, Frederick W Sent: Thursday, August 15, 2013 1:46 PM To: arslist@ARSLIST.ORG Subject: Re: User locks in db causing stuck threads that aren't timed out ** We are running a similar setup (pure custom, multiple servers, 7.6.04 on RHEL with Oracle that is) and I received an email from the DBA on a deadlock issue in the database. The email contained: *** 2013-07-31 16:24:14.028 *** SERVICE NAME:(SYS$USERS) 2013-07-31 16:24:14.027 *** SESSION ID:(1363.46981) 2013-07-31 16:24:14.027 DEADLOCK DETECTED ( ORA-00060 ) ... Information on the OTHER waiting sessions: Session 1354: pid=126 serial=2071 audsid=680033137 user: 98/USER O/S info: user: loginname, term: , ospid: 3264, machine: appserver program: arserverd@appserver (TNS V1- application name: arserverd@appserver (TNS V1-, hash value=0 Current SQL Statement: UPDATE arschema SET nextId = nextId + :"SYS_B_0" WHERE schemaId = :"SYS_B_1" End of information on OTHER waiting sessions. Current SQL statement for this session: UPDATE arschema SET nextId = nextId + :"SYS_B_0" WHERE schemaId = :"SYS_B_1" I searched the SQL logs (We have them turned on at all times) and could not find any ORA- errors in them. This led me to believe that the Oracle deadlock errors are not reported back to the application. He was able to see the ORA-00060 errors in his AlertLog and he emailed me the trace file for the error. Since I could see that it was an arschema nextId update I knew that it was the system doing a lookup to find the next entry ID for inserting a record to a form. 7.6 (I don't remember exactly which version implemented the feature) has the ability to have a server pre-allocate a set number of entry IDs so the server doesn't have to go back to the archema table as often. Look in the Admin console -> Server Information -> Configuration tab for "Next Request ID Block Size". If you don't want to set the block size for the entire server you can set it per form. To find out which form(s) were having problems I did the following on each application server: grep "UPDATE arschema" arsql.log > mylog_servername.log I then copied the grep files into a single place and combined them to be able to analyze how many requests were made for each form ("schemaId =") in the timeframe of the SQL log(s). In a 5 minute timeframe I discovered I had 1 form that had > 600 records created in it. On child forms with lots of records created I upped the block size to something like 40. The down side of using Block Size is that records may not be in sequential order anymore and that is why I couldn't just set it for the entire server and be done with it. Setting it for child forms where there are large number of inserts (like a history or audit trail type of form) eliminated the deadlocks for us. This is set per form on the Form Properties -> Basic tab of Developer studio. Another way to reduce overhead is to turn of the Status History for forms that do not need it. This is also on the Form Properties for each form. Fred From: Action Request System discussion list(ARSList) [mailto:arslist@ARSLIST.ORG] On Behalf Of Campbell, Paul (Paul) Sent: Thursday, August 15, 2013 10:08 AM To: arslist@ARSLIST.ORG Subject: User locks in db causing stuck threads that aren't timed out ** OK, put on your thinking caps for this one, it's a good one I had a situation where our database was experiencing user locks for a specific sql transaction that was causing fast threads to get hung in a waiting state, some for days, so for the background: Our environment: We are running a four RHEL 6 servers with 4 quad core CPUs and 96GB of ram , two are running ARS version 7.6.04 SP4 in a server group with an Oracle 10g database, two are Weblogic 10G for Mid-Tier and JSS XMLGateway. We are a fully custom shop, no OOB, virtually no user interaction, a backend alarm processing system using mostly web services and JMS for communications in and out. We have alarm receivers that use JMS queues that are polled by XMLGateway and it makes API calls to create a record in our alarm transaction form, which in the same process does a create of an Incident, which gets sent to the user ticketing system via an outbound Web Service Call. During the create of the incident, we do a set fields from a view form that points to a table in a remote oracle DB that we access via a database link. This remote DB lookup is done using the primary key/unique index in the where clause, Very fast, usually sub 100ms, even over the DB link. We have been running with a max of 50 fast threads and a max of 75 list threads. (I know, a convoluted process, I could spend weeks explaining the history) What happened: About a week and a half ago we noticed an issue where we were getting timeouts trying to perform inserts and updates via our Alarm receivers API app, and about this time we were also getting stuck threads in our Mid Tier servers as well. I started looking and we were seeing a growth in the number of fast threads in use and we were beginning to see the number of blocked items in the fast queue growing, so we restarted the app servers and all was good, after 2 days we are out of threads again, so I upped the max to 100 fast threads. Three to 4 days later, the same issue, all 100 were used, so this was a Sunday and I was not at home so we set the max fast threads to 200 until I could get back on Monday. So we initial thought it was caused by an increase of incoming alarms, which had doubled the first of August, but on Tuesday, we got a call from our DBA and we had over 1000 sessions in the database that were experiencing a user lock, and all 1000 had the exact same sql statement. We started investigating and this statement was the lookup from the remote database table where we get 3rd party vendor information for the site a device is associated with. As I said, all 1000 were for the exact same site. I started looking through SQL and filter logs and determined that when the filter that runs to do the set fields ran, the thread never logged any other actions beyond the sql statement, it just hung, no timeout, no error, just hung never to respond again. As we were clearing these sessions, we noticed that some were for remedy process IDs that were no longer running. After clearing about 400, we stumbled on the one that seemingly had locked all the subsequent sessions querying that record, because about 300 inserts and updates completed for alarms related to that site in like 30 seconds. So my questions are: 1. Has anyone ever experienced this type of issue where a database session just hung and neither Oracle nor Remedy generated a timeout? 2. Does anyone have any idea of a way to detect these before it gets real bad, other than looking for steady thread growth? Paul Campbell | Development Team Lead | TS&D SSBL, A2R WFE, and ESP Remedy Team | Avaya Client Services | | 1145 Sanctuary Parkway Lake View II Suite 110 Alpharetta, GA 30009 | 678-421-5342 Everyone needs deadlines. Even the beavers. They loaf around all summer, but when they are faced with the winter deadline, they work like fury. If we didn't have deadlines, we'd stagnate. Walt Disney _ARSlist: "Where the Answers Are" and have been for 20 years_ _______________________________________________________________________________ UNSUBSCRIBE or access ARSlist Archives at www.arslist.org "Where the Answers Are, and have been for 20 years"