Re: User locks in db causing stuck threads that aren't timed out

Campbell, Paul (Paul) Thu, 15 Aug 2013 11:32:08 -0700

I already have the Request ID Block Size  set to 50 on this form and the form 
it does a push to.  I've seen transaction locks on a insert/update statements 
before, but not on a select.

From: Action Request System discussion list(ARSList) 
[mailto:arslist@ARSLIST.ORG] On Behalf Of Grooms, Frederick W
Sent: Thursday, August 15, 2013 1:46 PM
To: arslist@ARSLIST.ORG
Subject: Re: User locks in db causing stuck threads that aren't timed out

**
We are running a similar setup (pure custom, multiple servers, 7.6.04 on RHEL 
with Oracle that is) and I received an email from the DBA on a deadlock issue 
in the database.   The email contained:

*** 2013-07-31 16:24:14.028
*** SERVICE NAME:(SYS$USERS) 2013-07-31 16:24:14.027
*** SESSION ID:(1363.46981) 2013-07-31 16:24:14.027
DEADLOCK DETECTED ( ORA-00060 )
...
Information on the OTHER waiting sessions:
Session 1354:
  pid=126 serial=2071 audsid=680033137 user: 98/USER
  O/S info: user: loginname, term: , ospid: 3264, machine: appserver
            program: arserverd@appserver (TNS V1-
  application name: arserverd@appserver (TNS V1-, hash value=0
  Current SQL Statement:
  UPDATE arschema SET nextId = nextId + :"SYS_B_0" WHERE schemaId = :"SYS_B_1"
End of information on OTHER waiting sessions.
Current SQL statement for this session:
UPDATE arschema SET nextId = nextId + :"SYS_B_0" WHERE schemaId = :"SYS_B_1"

I searched the SQL logs (We have them turned on at all times) and could not 
find any ORA- errors in them.  This led me to believe that the Oracle deadlock 
errors are not reported back to the application.  He was able to see the 
ORA-00060 errors in his AlertLog and he emailed me the trace file for the error.

Since I could see that it was an arschema nextId update I knew that it was the 
system doing a lookup to find the next entry ID for inserting a record to a 
form.  7.6 (I don't remember exactly which version implemented the feature) has 
the ability to have a server pre-allocate a set number of entry IDs so the 
server doesn't have to go back to the archema table as often.  Look in the 
Admin console -> Server Information -> Configuration tab for "Next Request ID 
Block Size".

If you don't want to set the block size for the entire server you can set it 
per form.  To find out which form(s) were having problems I did the following 
on each application server:
            grep "UPDATE arschema" arsql.log  > mylog_servername.log

I then copied the grep files into a single place and combined them to be able 
to analyze how many requests were made for each form ("schemaId =") in the 
timeframe of the SQL log(s).  In a 5 minute timeframe I discovered I had 1 form 
that had > 600 records created in it.  On child forms with lots of records 
created I upped the block size to something like 40.  The down side of using 
Block Size is that records may not be in sequential order anymore and that is 
why I couldn't just set it for the entire server and be done with it.  Setting 
it for child forms where there are large number of inserts (like a history or 
audit trail type of form) eliminated the deadlocks for us.  This is set per 
form on the Form Properties -> Basic tab of Developer studio.

Another way to reduce overhead is to turn of the Status History for forms that 
do not need it.  This is also on the Form Properties for each form.

Fred

From: Action Request System discussion list(ARSList) 
[mailto:arslist@ARSLIST.ORG] On Behalf Of Campbell, Paul (Paul)
Sent: Thursday, August 15, 2013 10:08 AM
To: arslist@ARSLIST.ORG
Subject: User locks in db causing stuck threads that aren't timed out

**
OK, put on your thinking caps for this one, it's a good one

I had a situation where our database was experiencing user locks for a specific 
sql transaction that was causing fast threads to get hung in a waiting state, 
some for days, so for the background:

Our environment:
We are running  a four RHEL 6 servers with 4 quad core CPUs and 96GB of ram , 
two are running ARS version 7.6.04 SP4 in a server group with an Oracle 10g  
database, two are Weblogic 10G for Mid-Tier and JSS XMLGateway.  We are a fully 
custom shop, no OOB, virtually no user interaction, a backend alarm processing 
system using mostly web services and JMS for communications in and out.  We 
have alarm receivers that use JMS queues that are polled by XMLGateway and it 
makes API calls to create a record in our alarm transaction form, which in the 
same process does a create of an Incident, which gets sent to the user 
ticketing system via an outbound Web Service Call.  During the create of the 
incident, we do a set fields from a view form that points to a table in a 
remote oracle DB that we access via a database link.  This remote DB lookup is 
done using the primary key/unique index in the where clause, Very fast, usually 
sub 100ms, even over the DB link.  We have been running with a max of 50 fast 
threads and a max of 75 list threads.

(I know, a convoluted process, I could spend weeks explaining the history)

What happened:
About a week and a half ago we noticed an issue where we were getting timeouts 
trying to perform inserts and updates via our Alarm receivers API app, and 
about this time we were also getting stuck threads in our Mid Tier servers as 
well.  I started looking and we were seeing a growth in the number of fast 
threads in use and we were beginning to see the number of blocked items in the 
fast queue growing, so we restarted the app servers and all was good, after 2 
days we are out of threads again, so I upped the max to 100 fast threads.  
Three to 4 days later, the same issue, all 100 were used, so this was a Sunday 
and I was not at home so we set the max fast threads to 200 until I could get 
back on Monday.  So we initial thought it was caused by an increase of incoming 
alarms, which had doubled the first of August, but on Tuesday, we got a call 
from our DBA and we had over 1000 sessions in the database that were 
experiencing a user lock, and all 1000 had the exact same sql statement.  We 
started investigating and this statement was the lookup from the remote 
database table where we get 3rd party vendor information for the site a device 
is associated with.  As I said, all 1000 were for the exact same site.  I 
started looking through SQL and filter logs and determined that when the filter 
that runs to do the set fields ran, the thread never logged any other actions 
beyond the sql statement, it just hung, no timeout, no error, just hung never 
to respond again.  As we were clearing these sessions, we noticed that some 
were for remedy process IDs that were no longer running.  After clearing about 
400, we stumbled on the one that seemingly had locked all the subsequent 
sessions querying that record, because about 300 inserts and updates completed 
for alarms related to that site in like 30 seconds.

So my questions are:

1.       Has anyone ever experienced this type of issue where a database 
session just hung and neither Oracle nor Remedy generated a timeout?

2.       Does anyone have any idea of a way to detect these before it gets real 
bad, other than looking for steady thread growth?

Paul Campbell  | Development Team Lead  |  TS&D SSBL, A2R WFE, and ESP Remedy 
Team |  Avaya Client Services  |
|  1145 Sanctuary Parkway Lake View II  Suite 110 Alpharetta, GA  30009  | 
678-421-5342

Everyone needs deadlines.  Even the beavers. They loaf around all summer, but 
when they are faced with the winter deadline, they work like fury. If we didn't 
have deadlines, we'd stagnate.  Walt Disney

_ARSlist: "Where the Answers Are" and have been for 20 years_

_______________________________________________________________________________
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
"Where the Answers Are, and have been for 20 years"

Re: User locks in db causing stuck threads that aren't timed out

Reply via email to