Tom Lane wrote: > I've been thinking about Ludwig Lim's recent report of a "stuck > spinlock" failure on a heavily loaded machine. Although I originally > found this hard to believe, there is a scenario which makes it > plausible. Suppose that we have a bunch of recently-started backends > as well as one or more that have been running a long time --- long > enough that the scheduler has niced them down a priority level or two. > Now suppose that one of the old-timers gets interrupted while holding > a spinlock (an event of small but nonzero probability), and that before > it can get scheduled again, several of the newer, higher-priority > backends all start trying to acquire the same spinlock. The "acquire" > code looks like "try to grab the spinlock a few times, then sleep for > 10 msec, then try again; give up after 1 minute". If there are enough > backends trying this that cycling through all of them takes at least > 10 msec, then the lower-priority backend will never get scheduled, and > after a minute we get the dreaded "stuck spinlock". > > To forestall this scenario, I'm thinking of introducing backoff into the > sleep intervals --- that is, after first failure to get the spinlock, > sleep 10 msec; after the second, sleep 20 msec, then 40, etc, with a > maximum sleep time of maybe a second. The number of iterations would be > reduced so that we still time out after a minute's total delay. > > Comments?
Should there be any correlation between the manner by which the backoff occurs and the number of active backends? Mike Mascari [EMAIL PROTECTED] ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])