On Wed, 13 Jan 2021 at 19:19, Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Sat, Dec 19, 2020 at 01:00:01PM +0800, Craig Ringer wrote:
> >
> > The attached patch set follows on from the discussion in [1] "Add LWLock
> > blocker(s) information" by adding the actual LWLock* and the numeric
> > tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint.
> >
> > This does not provide complete information on blockers, because it's not
> > necessarily valid to compare any two LWLock* pointers between two process
> > address spaces. The locks could be in DSM segments, and those DSM
> segments
> > could be mapped at different addresses.
> >
> > I wasn't able to work out a sensible way to map a LWLock* to any sort of
> > (tranche-id, lock-index) because there's no requirement that locks in a
> > tranche be contiguous or known individually to the lmgr.
> >
> > Despite that, the patches improve the information available for LWLock
> > analysis significantly.
>
> Thanks for the patches, this could be indeed useful. I've looked through
> and haven't noticed any issues with either the tracepoint extensions or
> commentaries, except that I find it is not that clear how trance_id
> indicates a re-initialization here?
>
>     /* Re-initialization of individual LWLocks is not permitted */
>     Assert(tranche_id >= NUM_INDIVIDUAL_LWLOCKS || !IsUnderPostmaster);
>

There should be no reason for anything to call LWLockInitialize(...) on an
individual LWLock, since they are all initialized during postmaster startup.

Doing so must be a bug.

But that's a trivial change that can be done separately.


> > Patch 2 adds the tranche id and lock pointer for each trace hit. This
> makes
> > it possible to differentiate between individual locks within a tranche,
> and
> > (so long as they aren't tranches in a DSM segment) compare locks between
> > processes. That means you can do lock-order analysis etc, which was not
> > previously especially feasible.
>
> I'm curious in which kind of situations lock-order analysis could be
> helpful?
>

If code-path 1 does

    LWLockAcquire(LockA, LW_EXCLUSIVE);
    ...
    LWLockAcquire(LockB, LW_EXCLUSIVE);

and code-path 2 does:

    LWLockAcquire(LockB, LW_EXCLUSIVE);
    ...
    LWLockAcquire(LockA, LW_EXCLUSIVE);

then they're subject to deadlock. But you might not actually hit that often
in test workloads if the timing required for the deadlock to occur is tight
and/or occurs on infrequent operations.

It's not always easy to reason about or prove things about lock order when
they're potentially nested deep within many layers of other calls and
callbacks. Obviously something we try to avoid with LWLocks, but not
impossible.

If you trace a workload and derive all possible nestings of lock acquire
order, you can then prove things about whether there are any possible
ordering conflicts and where they might arise.

A PoC to do so is on my TODO.

> Traces also don't have to do userspace reads for the tranche name all
> > the time, so the trace can run with lower overhead.
>
> This one is also interesting. Just for me to clarify, wouldn't there be
> a bit of overhead anyway (due to switching from kernel context to user
> space when a tracepoint was hit) that will mask name read overhead? Or
> are there any available numbers about it?
>

I don't have numbers on that. Whether it matters will depend way too much
on how you're using the probe points and collecting/consuming the data
anyway.

It's a bit unfortunate (IMO) that we make a function call for each
tracepoint invocation to get the tranche names. Ideally I'd prefer to be
able to omit the tranche names lookups for these probes entirely for
something as hot as LWLocks. But it's a bit of a pain to look up the
tranche names from an external trace tool, so instead I'm inclined to see
if we can enable systemtap's semaphores and only compute the tranche name
if the target probe is actually enabled. But that'd be separate to this
patch and require a build change in how systemtap support is compiled and
linked.

BTW, a user->kernel->user context switch only occurs when the trace tool's
probes use kernel space - such as for perf based probes, or for systemtap's
kernel-runtime probes. The same markers can be used by e.g. systemtap's
"dyninst" runtime that runs entirely in userspace.

Reply via email to