> On Mar 4, 2026, at 07:37, Masahiko Sawada <[email protected]> wrote:
>
> On Tue, Mar 3, 2026 at 2:49 PM Chao Li <[email protected]> wrote:
>>
>>
>>
>>> On Mar 4, 2026, at 04:48, Masahiko Sawada <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> On Wed, Feb 11, 2026 at 5:07 PM Chao Li <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>>> On Feb 11, 2026, at 11:19, Tom Lane <[email protected]> wrote:
>>>>>
>>>>> Chao Li <[email protected]> writes:
>>>>>> As $SUBJECT says, my understanding is no.
>>>>>
>>>>> It's not a great idea, but I'm not sure it's fatal. There are places
>>>>> that hold LWLocks for awhile.
>>>>>
>>>>>> I think LWLocks are generally only held for a very short duration,
>>>>>> like a few CPU cycles to read or modify some shared data,
>>>>>
>>>>> Spinlocks are treated that way, but we're willing to hold LWLocks
>>>>> longer. The main thing I'd be concerned about is that there is no
>>>>> deadlock-detection infrastructure for LWLocks, so you'd better be
>>>>> darn certain there is no possibility of deadlock. That usually
>>>>> means you want to limit the extent of code that could run while
>>>>> you're holding the lock.
>>>>>
>>>>> In your specific example, the thing I'd be afraid of is that an
>>>>> errcontext callback might do something you're not expecting.
>>>>> We don't forbid errcontext callbacks from doing catalog lookups,
>>>>> for instance. So on the whole I agree with this patch, with
>>>>> or without any concrete example that fails.
>>>>>
>>>>> regards, tom lane
>>>>
>>>> As Tom has agreed with this patch, added to CF for tracking:
>>>> https://commitfest.postgresql.org/patch/6477/
>>>>
>>>
>>> While I agree with the concern Tom pointed out, I can find that there
>>> are other places where we do ereport() while holding a lwlock. For
>>> instance:
>>>
>>> src/backend/access/transam/multixact.c:
>>> else if (!find_multixact_start(newOldestMulti, &newOldestOffset))
>>> {
>>> ereport(LOG,
>>> (errmsg("cannot truncate up to MultiXact %u because it
>>> does not exist on disk, skipping truncation",
>>> newOldestMulti)));
>>> LWLockRelease(MultiXactTruncationLock);
>>> return;
>>> }
>>>
>>> src/backend/commands/vacuum.c:
>>> if (frozenAlreadyWrapped)
>>> {
>>> ereport(WARNING,
>>> (errmsg("some databases have not been vacuumed in over
>>> 2 billion transactions"),
>>> errdetail("You might have already suffered
>>> transaction-wraparound data loss.")));
>>> LWLockRelease(WrapLimitsVacuumLock);
>>> return;
>>> }
>>>
>>> src/backend/postmaster/autovacuum.c:
>>> LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
>>>
>>> /*
>>> * No other process can put a worker in starting mode, so if
>>> * startingWorker is still INVALID after exchanging our lock,
>>> * we assume it's the same one we saw above (so we don't
>>> * recheck the launch time).
>>> */
>>> if (AutoVacuumShmem->av_startingWorker != NULL)
>>> {
>>> worker = AutoVacuumShmem->av_startingWorker;
>>> worker->wi_dboid = InvalidOid;
>>> worker->wi_tableoid = InvalidOid;
>>> worker->wi_sharedrel = false;
>>> worker->wi_proc = NULL;
>>> worker->wi_launchtime = 0;
>>> dclist_push_head(&AutoVacuumShmem->av_freeWorkers,
>>> &worker->wi_links);
>>> AutoVacuumShmem->av_startingWorker = NULL;
>>> ereport(WARNING,
>>> errmsg("autovacuum worker took too long to
>>> start; canceled"));
>>> }
>>>
>>> Should we fix these places as well?
>>
>> Maybe. If needed, I can take a look at them one by one.
>>
>>>
>>> Also, if we reverse the ereport() and LWLockRelease() in the specific
>>> example in logicalctl.c, it would happen that a concurrent logical
>>> decoding activation writes the log "logical decoding is enabled upon
>>> creating a new logical replication slot" before the deactivation
>>> "logical decoding is disabled because there are no valid logical
>>> replication slots", confusing users since the logical decoding is
>>> active even though the last log saying "logical decoding is disabled".
>>>
>>
>> That sounds reasonable. However, looking at the current code, the “enabling”
>> log is printed after releasing the lock:
>> ```
>> LWLockRelease(LogicalDecodingControlLock);
>>
>> END_CRIT_SECTION();
>>
>> if (!in_recovery)
>> ereport(LOG,
>> errmsg("logical decoding is enabled upon
>> creating a new logical replication slot"));
>> ```
>>
>> So the log order is not currently protected by the lock. If we really want
>> to ensure the ordering between these two messages, we might instead need to
>> move the “enabling” log before releasing the lock.
>
> No, IIUC activation happens while a valid logical replication slot is
> held. Since deactivation requires that no slots are present, there's
> no risk of a concurrent deactivation occurring between the
> LWLockRelease() and the ereport().
>
>> I wonder what you think would be the best way to proceed here.
>
> Thinking more about this: while the fix might be applicable elsewhere,
> it seems unnecessary here. The deactivation process is restricted to
> the startup and checkpointer processes. Given that these processes
> don't access system catalogs for example, I think there is little or
> zero risk of a deadlock occurring even if we do something in
> errcontext.
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
Thanks for the explanations. In that case, I will withdraw this patch.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/