Re: [HACKERS] less expensive pg_buffercache on big shmem

Tomas Vondra Tue, 20 Sep 2016 16:43:00 -0700

On 09/05/2016 06:19 PM, Ivan Kartyshov wrote:

On 09/03/2016 05:04 AM, Tomas Vondra wrote:

This patch needs a rebase, as 06d7fd6e bumped the version to 1.2.

Thank you for a valuable hint.

So, will we get a rebased patch? I see the patch is back in 'needsreview' but there's no new version.

> If we will replace consistent method, then we should replace it with

the

> partially consistent method (called "nonconsistent") because:
> 1) it's based on fast spinlocks (it's not fully up to its name, though)

In other words, it does exactly the thing proposed up-thread, i.e.
locking only buffer headers.

What do you mean by fast spinlocks? And why aren't they up to the name?


Not they (spinlocks), but the name “nonconsistent” is somewhat
misleading. At the moment we can't implement concurrent shared memory
access without locks in general, so most inconsistent method that has
been proposed was the "nonconsistent" one. But roughly speaking
*nonconsistent* is not as such by the name, because it contains a
locking mechanism (spinlocks).

I'm a bit confused by the naming, but my understanding is thatnonconsistent only acquires (spin)locks on individual buffers, so itdoes not have a consistent view on shared_buffers as a whole, but dataabout individual buffers are consistent. I think that's fine andperfectly acceptable for this extension.

> 2) it's *almost* the fastest one (the less time needed for execution of
> method, the less data core will change and as a consequence the more
> consistent snapshot will be)

I'm not sure I understand what you're saying here? What do you mean by
"almost the fastest one"?


I mean the fastest one that has been proposed ("consistent"|
"semiconsistent"| "nonconsistent").

I'm a bit confused by the results you reported before, i.e. that

1) nonconsistent is 10% faster than consistent method
2) semiconsistent is 5-times slower than nonconsistent method

What does that mean? Are you refering to duration of the queries reading
data from pg_buffercache, or to impact on the other queries?


Here I mean "working duration time".


I have no idea what "working duration time" is.

How can be semiconsistent 5x slower than nonconsistent? Wouldn't that
make it significantly slower than the consistent method?


Firstly, when we want to measure the quality of pg_buffercache, we must
measure several values:
1) Execution time (duration of the queries reading data from
pg_buffercache)
2) How it influences the system (and other queries) during its work

Secondly, the semiconsistent is slower than nonconsistent and consistent
method, but it makes less influence on other queries then consistent.

Thirdly, it is slower because it locks the partitions of shared memory
in a different way than in consistent or nonconsistent methods.
The semi-consistent strategy implies that only one buffer is locked at a
time. Therefor has no significant effect on the system, but it takes
more time.

It'd be really useful if you could provide actual numbers, explain whatmetrics you compare and how. I'm sure it makes sense to you but it'sutterly confusing for everyone else, and it also makes it impossible toreproduce the benchmark.

I've looked at the code (applied on top of e3b607c) and I see this inpg_buffercache_pages:


    for (j = 0; j < num_partit; j++)
    {
        if (snaptype == CONS_SNAP)
            for (i = 0; i < NUM_BUFFER_PARTITIONS; i++)
                LWLockAcquire(BufMappingPartitionLockByIndex(i),
                              LW_SHARED);
        else if (snaptype == SEMICONS_SNAP)
            LWLockAcquire(BufMappingPartitionLockByIndex(j), LW_SHARED);

        for (i = 0; i < NBuffers; i++)
        {
          ... walk through all shared buffers ...
        }
     }

How is the SEMICONS_SNAP case supposed to only scan the currently lockedpartition, when we simply scans all shared buffers for SEMICONS_SNAP?That seems outright broken, it should scan only the currently lockedpartition.


Secondly, I see this bit added to the loop over buffers:

    if (bufHdr->tag.forkNum == -1)
    {
        fctx->record[i].blocknum = InvalidBlockNumber;
        continue;
    }

and I have no idea why this is needed (when it was not needed before).

The patch also breaks a few comments, because it adds code between thecomment and the related code. For example the (likely unnecessary) bitof code changes this:


    /* Lock each buffer header before inspecting. */
    buf_state = LockBufHdr(bufHdr);

into this

    /* Lock each buffer header before inspecting. */
    if (bufHdr->tag.forkNum == -1)
    {
        fctx->record[i].blocknum = InvalidBlockNumber;
        continue;
    }
    buf_state = LockBufHdr(bufHdr);

which is confusing (and I'd argue broken).

Similarly when releasing the locks, the original comment/code blocklooks like this:


  /*
   * And release locks.  We do this in reverse order for two reasons:
   * (1) Anyone else who needs more than one of the locks will be trying
   * to lock them in increasing order; we don't want to release the
   * other process until it can get all the locks it needs. (2) This
   * avoids O(N^2) behavior inside LWLockRelease.
   */
  for (i = NUM_BUFFER_PARTITIONS; --i >= 0;)
    LWLockRelease(BufMappingPartitionLockByIndex(i));

but the patch changes is to this:

  if (snaptype == CONS_SNAP)
    for (i = NUM_BUFFER_PARTITIONS; --i >= 0;)
      LWLockRelease(BufMappingPartitionLockByIndex(i));
  else if (snaptype == SEMICONS_SNAP)
    /*
     * And release locks.  We do this in reverse order for two reasons:
     * (1) Anyone else who needs more than one of the locks will be
     * trying to lock them in increasing order; we don't want to
     * release the other process until it can get all the locks it
     * needs. (2) This avoids O(N^2) behavior inside LWLockRelease.
     */
    LWLockRelease(BufMappingPartitionLockByIndex(j));

Notice how the comment is moved from the loop to the block that onlyunlocks a single partition. That clearly makes no sense, as the commentis about the CONS_SNAP loop.

Now that I read the comment, I wonder whether the semiconsistent case(locking one partition at a time) changes this reverse-order unlockingbehavior or not - clearly, the guarantee that the other process will getall the locks at once is no longer true, as pg_buffercache will likelyhold the next one.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] less expensive pg_buffercache on big shmem

Reply via email to