On Thu, Aug 06, 2020 at 01:23:31AM +0000, k.jami...@fujitsu.com wrote:
On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:

Hi,
Thank you for your constructive review and comments.
Sorry for the late reply.

Hi,

On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > Indeed. The buffer mapping hashtable already is visible as a major
> > bottleneck in a number of workloads. Even in readonly pgbench if s_b
> > is large enough (so the hashtable is larger than the cache). Not to
> > speak of things like a cached sequential scan with a cheap qual and wide
rows.
>
> To be fair, the added overhead is in buffer allocation not buffer lookup.
> So it shouldn't add cost to fully-cached cases.  As Tomas noted
> upthread, the potential trouble spot is where the working set is
> bigger than shared buffers but still fits in RAM (so there's no actual
> I/O needed, but we do still have to shuffle buffers a lot).

Oh, right, not sure what I was thinking.


> > Wonder if the temporary fix is just to do explicit hashtable probes
> > for all pages iff the size of the relation is < s_b / 500 or so.
> > That'll address the case where small tables are frequently dropped -
> > and dropping large relations is more expensive from the OS and data
> > loading perspective, so it's not gonna happen as often.
>
> Oooh, interesting idea.  We'd need a reliable idea of how long the
> relation had been (preferably without adding an lseek call), but maybe
> that's do-able.

IIRC we already do smgrnblocks nearby, when doing the truncation (to figure out
which segments we need to remove). Perhaps we can arrange to combine the
two? The layering probably makes that somewhat ugly :(

We could also just use pg_class.relpages. It'll probably mostly be accurate
enough?

Or we could just cache the result of the last smgrnblocks call...


One of the cases where this type of strategy is most intersting to me is the 
partial
truncations that autovacuum does... There we even know the range of tables
ahead of time.

Konstantin tested it on various workloads and saw no regression.

Unfortunately Konstantin did not share any details about what workloads
he tested, what config etc. But I find the "no regression" hypothesis
rather hard to believe, because we're adding non-trivial amount of code
to a place that can be quite hot.

And I can trivially reproduce measurable (and significant) regression
using a very simple pgbench read-only test, with amount of data that
exceeds shared buffers but fits into RAM.

The following numbers are from a x86_64 machine with 16 cores (32 w HT),
64GB of RAM, and 8GB shared buffers, using pgbench scale 1000 (so 16GB,
i.e. twice the SB size).

With simple "pgbench -S" tests (warmup and then 15 x 1-minute runs with
1, 8 and 16 clients - see the attached script for details) I see this:

               1 client    8 clients    16 clients
    ----------------------------------------------
    master        38249       236336        368591
    patched       35853       217259        349248
                    -6%          -8%           -5%

This is average of the runs, but the conclusions for medians are almost
exactly te same.

But I understand the sentiment on the added overhead on BufferAlloc.
Regarding the case where the patch would potentially affect workloads
that fit into RAM but not into shared buffers, could one of Andres'
suggested idea/s above address that, in addition to this patch's
possible shared invalidation fix? Could that settle the added overhead
in BufferAlloc() as temporary fix?

Not sure.

Thomas Munro is also working on caching relation sizes [1], maybe that
way we could get the latest known relation size. Currently, it's
possible only during recovery in smgrnblocks.

It's not clear to me how would knowing the relation size help reducing
the overhead of this patch?

Can't we somehow identify cases when this optimization might help and
only actually enable it in those cases? Like in a recovery, with a lot
of truncates, or something like that.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment: run.sh
Description: Bourne shell script

1 client
38408.098942
37148.085739
38553.640548
37243.143856
38487.044062
37225.251112
38584.828645
37474.801133
38506.580090
37942.415423
38438.285457
38249.098321
37232.708808
37368.309774
38268.592606
8 clients
242598.113719
240728.545096
238609.937059
236060.993323
235654.171372
238064.785008
244072.880501
233852.274273
229449.357520
238786.728666
229605.148726
233260.304044
232185.071739
242031.025232
236335.954147
16 clients
365702.997309
366334.247594
371966.060740
376383.183023
375279.446966
375416.168457
368591.075982
367195.919635
355262.316767
368375.432125
361657.667234
386455.075762
365156.675183
372176.374557
376528.470288
1 client
35725.116701
34158.764487
36346.391570
35951.323419
33164.079463
35317.282661
35991.253869
36031.895254
36446.263439
33969.603184
32546.144221
36465.661628
36246.570646
35852.782217
34677.603098
8 clients
214320.502337
223890.751691
213921.995625
214569.182623
215144.759957
223018.095568
218089.639999
217755.461221
217259.258268
225533.656349
226289.201510
212267.029873
214954.731109
218113.356619
210749.229416
16 clients
342668.984341
356999.569014
350751.292688
353380.258701
347648.832986
346108.809856
340116.169279
355782.130289
350859.226050
348832.982648
354467.086425
349248.179668
359556.618191
347499.041862
346793.236680

Reply via email to