Oracle 8.1.5.1

Solaris 2.6

 

Hi there,

 

For the last few days one of our servers will suddenly experience ridiculously high load averages (50-65) and all applications grind to a halt.

 

Restarting the database solves the problem and shutting down a specific apps servers solves the problem.  This apps server generally has about 2-3 JDBC connections to the database.  I just started looking at it today (it's not one of my databases) and when it recurred I started checking wait events.

 

The top 10 wait events for about a 2 minute period were as follows:

 

EVENT                                    Delta Waits      Delta Timeouts

latch free                                  317376          185591             

SQL*Net message from client              66435           0            

SQL*Net message to client                  66419           0            

db file sequential read               21709           0            

SQL*Net more data to client                18066           0            

file open                                   1223                0         

rdbms ipc message                   576     170    

log file sync                               519     1         

log file parallel write                  374     0         

log file sequential read               343     0         

 

 

It looks like a latch contention problem.  By the time I drilled down to specific sessions that were experiencing the "latch free" wait the apps server had been restarted.  My query showed most waiting on the "process allocation" latch but I don't know for sure that this was the latch that was causing the problem during the high load period.

 

I prepared this query to run the next time the problem occurs in order to identify which latches are being waited on:

 

select s.sid,s.username,s.program,s.status,

se.event,se.total_waits,se.total_timeouts,

se.time_waited,se.average_wait,sw.p2,v.name

from v$session s, v$session_event se, v$session_wait sw,v$latch v

 where s.sid=se.sid

 and s.sid=sw.sid

 and sw.p2=l.latch#

 and s.status='ACTIVE'

 AND S.USERNAME IS NOT NULL

 and se.event='latch free';

 

Does this make sense?  Any suggestions for other things to check?  Some reading suggests that latch contention is often caused by not using bind variables and that increasing db_block_lru_latches (currently = 2) can alleviate the problem while the code is being fixed.

 

But the question is why this problem suddenly happens and the problem doesn't go away until the apps server is shut down.  The old version of the application (which didn't use JDBC) did not have this problem.  I'm theorizing that perhaps JDBC handles its connections differently and after waiting for some period of time resends the transaction and keeps resending it faster than Oracle can clean things up?  

 

Has anyone experienced a similar problem or have other suggestions of where to go from here?

 

 

Thanks,

Jay Miller

Reply via email to