On 09/05/2016 06:19 PM, Ivan Kartyshov wrote:
On 09/03/2016 05:04 AM, Tomas Vondra wrote:
This patch needs a rebase, as 06d7fd6e bumped the version to 1.2.
Thank you for a valuable hint.

So, will we get a rebased patch? I see the patch is back in 'needs review' but there's no new version.

> If we will replace consistent method, then we should replace it with
> partially consistent method (called "nonconsistent") because:
> 1) it's based on fast spinlocks (it's not fully up to its name, though)

In other words, it does exactly the thing proposed up-thread, i.e.
locking only buffer headers.

What do you mean by fast spinlocks? And why aren't they up to the name?

Not they (spinlocks), but the name “nonconsistent” is somewhat
misleading. At the moment we can't implement concurrent shared memory
access without locks in general, so most inconsistent method that has
been proposed was the "nonconsistent" one. But roughly speaking
*nonconsistent* is not as such by the name, because it contains a
locking mechanism (spinlocks).

I'm a bit confused by the naming, but my understanding is that nonconsistent only acquires (spin)locks on individual buffers, so it does not have a consistent view on shared_buffers as a whole, but data about individual buffers are consistent. I think that's fine and perfectly acceptable for this extension.

> 2) it's *almost* the fastest one (the less time needed for execution of
> method, the less data core will change and as a consequence the more
> consistent snapshot will be)

I'm not sure I understand what you're saying here? What do you mean by
"almost the fastest one"?

I mean the fastest one that has been proposed ("consistent"|
"semiconsistent"| "nonconsistent").

I'm a bit confused by the results you reported before, i.e. that

1) nonconsistent is 10% faster than consistent method
2) semiconsistent is 5-times slower than nonconsistent method

What does that mean? Are you refering to duration of the queries reading
data from pg_buffercache, or to impact on the other queries?

Here I mean "working duration time".

I have no idea what "working duration time" is.

How can be semiconsistent 5x slower than nonconsistent? Wouldn't that
make it significantly slower than the consistent method?

Firstly, when we want to measure the quality of pg_buffercache, we must
measure several values:
1) Execution time (duration of the queries reading data from
2) How it influences the system (and other queries) during its work

Secondly, the semiconsistent is slower than nonconsistent and consistent
method, but it makes less influence on other queries then consistent.

Thirdly, it is slower because it locks the partitions of shared memory
in a different way than in consistent or nonconsistent methods.
The semi-consistent strategy implies that only one buffer is locked at a
time. Therefor has no significant effect on the system, but it takes
more time.

It'd be really useful if you could provide actual numbers, explain what metrics you compare and how. I'm sure it makes sense to you but it's utterly confusing for everyone else, and it also makes it impossible to reproduce the benchmark.

I've looked at the code (applied on top of e3b607c) and I see this in pg_buffercache_pages:

    for (j = 0; j < num_partit; j++)
        if (snaptype == CONS_SNAP)
            for (i = 0; i < NUM_BUFFER_PARTITIONS; i++)
        else if (snaptype == SEMICONS_SNAP)
            LWLockAcquire(BufMappingPartitionLockByIndex(j), LW_SHARED);

        for (i = 0; i < NBuffers; i++)
          ... walk through all shared buffers ...

How is the SEMICONS_SNAP case supposed to only scan the currently locked partition, when we simply scans all shared buffers for SEMICONS_SNAP? That seems outright broken, it should scan only the currently locked partition.

Secondly, I see this bit added to the loop over buffers:

    if (bufHdr->tag.forkNum == -1)
        fctx->record[i].blocknum = InvalidBlockNumber;

and I have no idea why this is needed (when it was not needed before).

The patch also breaks a few comments, because it adds code between the comment and the related code. For example the (likely unnecessary) bit of code changes this:

    /* Lock each buffer header before inspecting. */
    buf_state = LockBufHdr(bufHdr);

into this

    /* Lock each buffer header before inspecting. */
    if (bufHdr->tag.forkNum == -1)
        fctx->record[i].blocknum = InvalidBlockNumber;
    buf_state = LockBufHdr(bufHdr);

which is confusing (and I'd argue broken).

Similarly when releasing the locks, the original comment/code block looks like this:

   * And release locks.  We do this in reverse order for two reasons:
   * (1) Anyone else who needs more than one of the locks will be trying
   * to lock them in increasing order; we don't want to release the
   * other process until it can get all the locks it needs. (2) This
   * avoids O(N^2) behavior inside LWLockRelease.
  for (i = NUM_BUFFER_PARTITIONS; --i >= 0;)

but the patch changes is to this:

  if (snaptype == CONS_SNAP)
    for (i = NUM_BUFFER_PARTITIONS; --i >= 0;)
  else if (snaptype == SEMICONS_SNAP)
     * And release locks.  We do this in reverse order for two reasons:
     * (1) Anyone else who needs more than one of the locks will be
     * trying to lock them in increasing order; we don't want to
     * release the other process until it can get all the locks it
     * needs. (2) This avoids O(N^2) behavior inside LWLockRelease.

Notice how the comment is moved from the loop to the block that only unlocks a single partition. That clearly makes no sense, as the comment is about the CONS_SNAP loop.

Now that I read the comment, I wonder whether the semiconsistent case (locking one partition at a time) changes this reverse-order unlocking behavior or not - clearly, the guarantee that the other process will get all the locks at once is no longer true, as pg_buffercache will likely hold the next one.


Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to