I'm having a rare but deadly problem.  On our web servers, a process occasionally gets stuck, and 
can't be unstuck.  Once it's stuck, all Postgres activities cease.  "kill -9" is required 
to kill it -- signals 2 and 15 don't work, and "/etc/init.d/postgresql stop" fails.

Here's what the process table looks like:

$ ps -ef | grep postgres
postgres 30713     1  0 Apr24 ?        00:02:43 /usr/local/pgsql/bin/postmaster 
-p 5432 -D /disk3/postgres/data
postgres 25423 30713  0 May08 ?        00:03:34 postgres: writer process
postgres 25424 30713  0 May08 ?        00:00:02 postgres: stats buffer process
postgres 25425 25424  0 May08 ?        00:00:02 postgres: stats collector 
process
postgres 11918 30713 21 07:37 ?        02:00:27 postgres: production webuser 
127.0.0.1(21772) SELECT
postgres 31624 30713  0 16:11 ?        00:00:00 postgres: production webuser 
[local] idle
postgres 31771 30713  0 16:12 ?        00:00:00 postgres: production webuser 
127.0.0.1(12422) idle
postgres 31772 30713  0 16:12 ?        00:00:00 postgres: production webuser 
127.0.0.1(12421) idle
postgres 31773 30713  0 16:12 ?        00:00:00 postgres: production webuser 
127.0.0.1(12424) idle
postgres 31774 30713  0 16:12 ?        00:00:00 postgres: production webuser 
127.0.0.1(12425) idle
postgres 31775 30713  0 16:12 ?        00:00:00 postgres: production webuser 
127.0.0.1(12426) idle
postgres 31776 30713  0 16:12 ?        00:00:00 postgres: production webuser 
127.0.0.1(12427) idle
postgres 31777 30713  0 16:12 ?        00:00:00 postgres: production webuser 
127.0.0.1(12428) idle

The SELECT process is the one that's stuck.  top(1) and other indicators show that 
nothing is going on at all (no CPU usage, normal memory usage); the process seems to be 
blocked waiting for something.  (The "idle" processes are attached to a FastCGI 
program.)

This has happened on *two different machines*, both doing completely different 
tasks.  The first one is essentially a read-only warehouse that serves lots of 
queries, and the second one is the server we use to load the warehouse.  In 
both cases, Postgres has been running for a long time, and is issuing SELECT 
statements that it's issued millions of times before with no problems.  No 
other processes are accessing Postgres, just the web services.

This is a deadly bug, because our web site goes dead when this happens, and it 
requires an administrator to log in and kill the stuck postgres process then 
restart Postgres.  We've installed failover system so that the web site is 
diverted to a backup server, but since this has happened twice in one week, 
we're worried.

Any ideas?

Details:

   Postgres 8.0.3
   Linux 2.6.12-1.1381_FC3smp i686 i386

   Dell 2-CPU Xeon system (hyperthreading is enabled)
   4 GB memory
   2 120 GB disks (SATA on machine 1, IDE on machine 2)

Thanks,
Craig

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

              http://archives.postgresql.org

Reply via email to