Hi everyone!

I've been working on a puzzling issue for a few days am am hoping that someone 
has seen something similar or can help.  There have been some odd behaviors on 
one of my production facing postgres servers.


version info from postgres: PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, 
compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit


The symptom:   The database machine (running postgres 9.1.9 on CentOS 6.4) is 
running a low utilization most of the time, but once every day or two, it will 
appear to slow down to the point where queries back up and clients are unable 
to connect.  Once this event occurs, there are lots of concurrent queries, I 
see slow queries appear in the logs, but there doesn't appear to be anything 
abnormal that I have been able to see that causes this behavior.  The event 
will occur just long enough for monitoring to alarm.   We will respond to 
alerts to take a look, but within a minute or three at most, load returns back 
to normal levels and all running queries complete in expected times.


At the time of the event, we see a spike in system CPU and load average, but we 
do not see a corresponding spike in disk reads or writes which would indicate 
IO load.   Initial troubleshooting to monitor active processes led us to see a 
flurry of activity in ps waiting on semtimedop.   Our efforts internally to 
diagnose this problem are to sample pg_locks and pg_stat_activity every 5s plus 
running a script to look for at least one postgres process waiting on a 
semaphore, and if it finds one, it gets a stack trace of every running postgres 
processes with GDB.  It also uses strace on 5 processes to find out which 
semaphore they're waiting on.


What we were catching in the following stack trace seems to be representative 
of where things are waiting when we see an event - here are two examples that 
are representative - lots of threads will appear to be in this state:


----- 47245 -----
0x00000037392eb197 in semop () from /lib64/libc.so.6
#0  0x00000037392eb197 in semop () from /lib64/libc.so.6
#1  0x00000000005e0c87 in PGSemaphoreLock ()
#2  0x000000000061e3af in LWLockAcquire ()
#3  0x000000000060aa0f in ReadBuffer_common ()
#4  0x000000000060b2e4 in ReadBufferExtended ()
#5  0x000000000047708d in _bt_relandgetbuf ()
#6  0x000000000047aac4 in _bt_search ()
#7  0x000000000047af8d in _bt_first ()
#8  0x0000000000479704 in btgetbitmap ()
#9  0x00000000006e7e00 in FunctionCall2Coll ()
#10 0x0000000000473120 in index_getbitmap ()
#11 0x00000000005726b8 in MultiExecBitmapIndexScan ()
#12 0x000000000057214d in BitmapHeapNext ()
#13 0x000000000056b18e in ExecScan ()
#14 0x0000000000563ed8 in ExecProcNode ()
#15 0x0000000000562d72 in standard_ExecutorRun ()
#16 0x000000000062ce67 in PortalRunSelect ()
#17 0x000000000062e128 in PortalRun ()
#18 0x000000000062bb66 in PostgresMain ()
#19 0x00000000005ecd01 in ServerLoop ()
#20 0x00000000005ef401 in PostmasterMain ()
#21 0x0000000000590ff8 in main ()

----- 47257 -----
0x00000037392eb197 in semop () from /lib64/libc.so.6
#0  0x00000037392eb197 in semop () from /lib64/libc.so.6
#1  0x00000000005e0c87 in PGSemaphoreLock ()
#2  0x000000000061e3af in LWLockAcquire ()
#3  0x000000000060aa0f in ReadBuffer_common ()
#4  0x000000000060b2e4 in ReadBufferExtended ()
#5  0x000000000047708d in _bt_relandgetbuf ()
#6  0x000000000047aac4 in _bt_search ()
#7  0x000000000047af8d in _bt_first ()
#8  0x00000000004797d1 in btgettuple ()
#9  0x00000000006e7e00 in FunctionCall2Coll ()
#10 0x000000000047339d in index_getnext ()
#11 0x0000000000575ed6 in IndexNext ()
#12 0x000000000056b18e in ExecScan ()
#13 0x0000000000563ee8 in ExecProcNode ()
#14 0x0000000000562d72 in standard_ExecutorRun ()
#15 0x000000000062ce67 in PortalRunSelect ()
#16 0x000000000062e128 in PortalRun ()
#17 0x000000000062bb66 in PostgresMain ()
#18 0x00000000005ecd01 in ServerLoop ()
#19 0x00000000005ef401 in PostmasterMain ()
#20 0x0000000000590ff8 in main ()


Has any on the forum seen something similar?   Any suggestions on what to look 
at next?    If it is helpful to describe the server hardware, it's got 2 
E5-2670 cpu and 256 GB of ram, and the database is hosted on 1.6TB raid 10 
local storage (15K 300 GB drives).   The workload is predominantly read and the 
queries are mostly fairly simple selects from a single large table generally 
specifying the primary key as part of the where clause along with a few other 
filters.


Thanks,

Matt

Reply via email to