Thanks for the reply, Tom. I admit we were a bit rushed during the troubleshooting process; now that we know precisely how to identify these procs and deal with them, I imagine we'll grab more info next time before killing them, including a netstat view and a stack trace per your recommendation.
Cheers, -Chris. On Mon, May 16, 2011 at 1:04 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Christopher Opena <counterv...@gmail.com> writes: > > Hello all, > > First time poster here - probably a good sign since I've been running > > postgresql with zero issues for the last several years! At any rate, > I've > > recently run into a strange issue. Client-based EOFs are nothing new to > our > > application; people can sometimes close a connection for a number of > > reasons. However, on the DB side these have always been released with no > > issue. > > > Today, we had 4 connections that saw either a client-based EOF or a > > 'canceling statement due to user request', and in all of these 4 cases > the > > query remained active and started chewing heavily into the cpu such that > > they ground the db server to a halt until each of the procpids were > manually > > issued a 'pg_cancel_backed ( procpid );' from the console. None of the 4 > > queries were particularly heavy; they registered approximately a 2300ms > > completion time using a query EXPLAIN ANALYZE. A little long, but not > too > > far out of the ordinary on our usual database usage. > > It's always been the case that if the client drops the connection > mid-query, the backend will notice that only when it next tries to send > some data --- and even then, if the kernel isn't aware that the far end > has dropped the connection, the kernel may accept and buffer quite a bit > of data before blocking the backend from doing more work. So if the > query requires a lot of processing before it starts to emit data, the > backend can do a lot of work before noticing anything is wrong. > I suspect that these queries were expensive and you made a mistake in > what you tested after the fact (or maybe conditions changed, eg > autovacuum came along and updated statistics). It's also possible that > some network glitch affected the connections such that the server's > kernel didn't know they were lost. It's hard to say much more than that > without a lot more evidence than you've provided. If it happens again, > you might try poking a bit harder into the situation before you kill the > queries --- netstat's opinion of the connection statuses might be > interesting, and so might stack traces from the busy backends. > > regards, tom lane >