> The bottom line turns out to be that on the Alpha hardware, it is > possible for TAS() to fail even when the lock is initially zero, > because that hardware's locking protocol will fail to acquire the > lock if the ldq_l/stq_c sequence is interrupted. TAS() *must* be > called in a retry loop on Alphas. Thus, the coding presently in > xlog.c, > > while (TAS(&(XLogCtl->chkp_lck))) > { > struct timeval delay = {2, 0}; > > if (shutdown) > elog(STOP, "Checkpoint lock is busy while data > base is shutting down"); > (void) select(0, NULL, NULL, NULL, &delay); > } > > is no good because it does not allow for multiple retries. > > Offhand I see no good reason why the above-quoted code isn't just > > S_LOCK(&(XLogCtl->chkp_lck)); > > and propose to fix this problem by reducing it to that. If the lock > is held when it shouldn't be, we'll fail with a stuck-spinlock error. It was not test against stuck slock: postmaster can shutdown db only when there is no backends running and this was just additional test for that. Assuming that postmaster does right things I don't object to get rid of this test, but I would just remove shutdown test itself - checkpoints run long time and 2 sec timeout is better than ones in s_lock_sleep(); > It also bothers me that xlog.c contains several places where > there is a potentially infinite wait for a lock. It seems to me > that these should time out with stuck-spinlock messages. Do you object > to such a change? I always considered time-based decision is slock stuck or not as hack - in old times elog(ERROR/FATAL) didn't unlock slocks sometimes, isn't it fixed now? Anyway I don't object if it bothers you -:) But please do not use S_LOCK - as you see WAL code try to do other things if slock of "primary interest" is busy. Vadim