Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

Michael Paquier Mon, 02 Oct 2017 17:55:18 -0700

On Tue, Oct 3, 2017 at 6:54 AM, Tom Lane <[email protected]> wrote:
> I wrote:
>> If this is the only problem then I'd agree we should stick to a spinlock
>> (I assume the strings in question can't be very long).  I was thinking
>> more about what to do if we find other violations that are harder to fix.


I don't think that there is any need to switch to a LWLock. Any issues
in need to be dealt with here don't require it, if we are fine with
the memcpy method of course.

> I took a quick look through walreceiver.c, and there are a couple of
> obvious problems of the same ilk in WalReceiverMain, which is doing this:
>
>         walrcv->lastMsgSendTime = walrcv->lastMsgReceiptTime = 
> walrcv->latestWalEndTime = GetCurrentTimestamp();
>
> (ie, a potential kernel call) inside a spinlock.  But there seems no
> real problem with just collecting the timestamp before we enter that
> critical section.

No problems seen either from here.

> I also don't especially like the fact that just above there it reaches
> elog(PANIC) with the lock still held, though at least that's a case that
> should never happen.

This part has been around since the beginning in 1bb2558. I agree that
the lock should be released before doing the logging.

> Further down, it's doing a pfree() inside the spinlock, apparently
> for no other reason than to save one "if (tmp_conninfo)".

Check.

> I don't especially like the Asserts inside spinlocks, either.  Personally,
> I think if those conditions are worth testing then they're worth testing
> for real (in production).  Variables that are manipulated by multiple
> processes are way more likely to assume unexpected states than local
> variables.

Those could be replaced by similar checks using some
PG_USED_FOR_ASSERTS_ONLY out of the spin lock sections, though I am
not sure if those are worth worrying. What do others think?

> I'm also rather befuddled by the fact that this code sets and clears
> walrcv->latch outside the critical sections.  If that field is used
> by any other process, surely that's completely unsafe.  If it isn't,
> why is it being kept in shared memory?

Do you mean the introduction of WalRcvForceReply by 314cbfc? This is
more recent, and has been discussed during the review of the
remote_apply patch here to avoid sending SIGUSR1 too much from the
startup process to the WAL receiver:
https://www.postgresql.org/message-id/CA+TgmobPsROS-gFk=_KJdW5scQjcKtpiLhsH9Cw=uwh1htf...@mail.gmail.com

I am attaching a patch that addresses the bugs for the spin lock sections.
-- 
Michael

walreceiver-spin-calls.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

Reply via email to