On Thu, Apr 26, 2018 at 10:54 AM, Peter Geoghegan <p...@bowt.ie> wrote:
> On Wed, Apr 25, 2018 at 5:26 AM, Thomas Munro
> <thomas.mu...@enterprisedb.com> wrote:
>> I think pgbench isn't a very interesting workload though.  I don't
>> have time right now but it would be fun to dig further and try to
>> construct workloads that generate pathological searches for buffers (I
>> believe that such workloads exist, from anecdotal reports).
>
> While pgbench might not be the most interesting workload, I think even
> its super-synthetic, uniformly distributed point lookups can benefit
> from better caching decisions. The LRU-K paper's example 1.1 explains
> why this might be very cogently: Point lookups from TPC-A (or TPC-C)
> consist of descending the index, and then accessing the record in the
> main table from a TID, which actually implies quite a non-uniform
> access pattern for individual pages. Despite accessing table records
> uniformly.
>
> There are a negligible number of internal pages involved with pgbench
> (and they should always be cached), so they can almost be ignored.
> It's really just a matter of what proportion of shared_buffers should
> be leaf pages versus heap pages. The pgbench accounts table is
> approximately 6x the size of its primary key, so the leaf blocks have
> 6x the utility of heap blocks for this workload. It actually is pretty
> much that simple. Or it would be, if we didn't have the OS cache as a
> confounding factor. The optimal strategy for the pgbench workload as
> far as maximizing hits in shared_buffers goes is to *only* cache leaf
> blocks, at least until you have all leaf blocks cached; we should
> evict table blocks *immediately*. This isn't even remotely close to
> the behavior of an LRU, of course. Actually, as the LRU-K paper also
> points out, an LRU has a tendency to perversely somewhat *favor* the
> table blocks, since presumably each point lookup's table block is
> accessed after its leaf block is pinned/accessed.

Huh.  Right.  So it's not truly uniform.  I wonder if any of these
algorithms are sensitive to the higher value of the leaf pages than
the heap pages.  It seems quite subtle: even though we can see that
the individual leaf pages must be 6x more likely to be accessed again
next time than individual heap pages due to their higher tuple
density, they're still very unlikely to be accessed again soon, and
the question is whether any of these algorithms can track that for
long enough to see the difference between two very low access
frequencies, in a huge field of unlikely-to-be-accessed pages.  LRU,
by not even attempting to model frequency, is I guess uniquely
affected by the heap-after-index-leaf thing.  (I haven't read the
LRU-K paper... I have a lot to learn before I could contribute much
here.  Added to pile.  Thanks.)

A thought experiment about the U-shaped performance when your dataset
fits in neither PG nor kernel cache, but would fit entirely in
physical memory if you made either of the two caches big enough:  I
suppose when you read a page in, you could tell the kernel that you
POSIX_FADV_DONTNEED it, and when you steal a clean PG buffer you could
tell the kernel that you POSIX_FADV_WILLNEED its former contents (in
advance somehow), on the theory that the coldest stuff in the PG cache
should now become the hottest stuff in the OS cache.  Of course that
sucks, because the best the kernel can do then is go and read it from
disk, and the goal is to avoid IO.  Given a hypothetical way to
"write" "clean" data to the kernel (so it wouldn't mark it dirty and
generate IO, but it would let you read it back without generating IO
if you're lucky), then perhaps you could actually achieve exclusive
caching at the two levels, and use all your physical RAM without
duplication.  Then perhaps the U shape would go away, and the curve
would instead go up all the way until shared buffers is big enough to
whole the whole dataset?

> I'm glad that you're thinking about this, in any case. Seems like a
> promising area to work on.

It's a promising area that I'm interested in learning more about, but
I'm not promising to work on it :-)

-- 
Thomas Munro
http://www.enterprisedb.com

Reply via email to