On Thu, Aug 06, 2020 at 01:23:31AM +0000, k.jami...@fujitsu.com wrote:
On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:Hi, Thank you for your constructive review and comments. Sorry for the late reply.Hi, On 2020-07-31 15:50:04 -0400, Tom Lane wrote: > Andres Freund <and...@anarazel.de> writes: > > Indeed. The buffer mapping hashtable already is visible as a major > > bottleneck in a number of workloads. Even in readonly pgbench if s_b > > is large enough (so the hashtable is larger than the cache). Not to > > speak of things like a cached sequential scan with a cheap qual and wide rows. > > To be fair, the added overhead is in buffer allocation not buffer lookup. > So it shouldn't add cost to fully-cached cases. As Tomas noted > upthread, the potential trouble spot is where the working set is > bigger than shared buffers but still fits in RAM (so there's no actual > I/O needed, but we do still have to shuffle buffers a lot). Oh, right, not sure what I was thinking. > > Wonder if the temporary fix is just to do explicit hashtable probes > > for all pages iff the size of the relation is < s_b / 500 or so. > > That'll address the case where small tables are frequently dropped - > > and dropping large relations is more expensive from the OS and data > > loading perspective, so it's not gonna happen as often. > > Oooh, interesting idea. We'd need a reliable idea of how long the > relation had been (preferably without adding an lseek call), but maybe > that's do-able. IIRC we already do smgrnblocks nearby, when doing the truncation (to figure out which segments we need to remove). Perhaps we can arrange to combine the two? The layering probably makes that somewhat ugly :( We could also just use pg_class.relpages. It'll probably mostly be accurate enough? Or we could just cache the result of the last smgrnblocks call... One of the cases where this type of strategy is most intersting to me is the partial truncations that autovacuum does... There we even know the range of tables ahead of time.Konstantin tested it on various workloads and saw no regression.
Unfortunately Konstantin did not share any details about what workloads he tested, what config etc. But I find the "no regression" hypothesis rather hard to believe, because we're adding non-trivial amount of code to a place that can be quite hot. And I can trivially reproduce measurable (and significant) regression using a very simple pgbench read-only test, with amount of data that exceeds shared buffers but fits into RAM. The following numbers are from a x86_64 machine with 16 cores (32 w HT), 64GB of RAM, and 8GB shared buffers, using pgbench scale 1000 (so 16GB, i.e. twice the SB size). With simple "pgbench -S" tests (warmup and then 15 x 1-minute runs with 1, 8 and 16 clients - see the attached script for details) I see this: 1 client 8 clients 16 clients ---------------------------------------------- master 38249 236336 368591 patched 35853 217259 349248 -6% -8% -5% This is average of the runs, but the conclusions for medians are almost exactly te same.
But I understand the sentiment on the added overhead on BufferAlloc. Regarding the case where the patch would potentially affect workloads that fit into RAM but not into shared buffers, could one of Andres' suggested idea/s above address that, in addition to this patch's possible shared invalidation fix? Could that settle the added overhead in BufferAlloc() as temporary fix?
Not sure.
Thomas Munro is also working on caching relation sizes [1], maybe that way we could get the latest known relation size. Currently, it's possible only during recovery in smgrnblocks.
It's not clear to me how would knowing the relation size help reducing the overhead of this patch? Can't we somehow identify cases when this optimization might help and only actually enable it in those cases? Like in a recovery, with a lot of truncates, or something like that. regards -- Tomas Vondra http://www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
run.sh
Description: Bourne shell script
1 client 38408.098942 37148.085739 38553.640548 37243.143856 38487.044062 37225.251112 38584.828645 37474.801133 38506.580090 37942.415423 38438.285457 38249.098321 37232.708808 37368.309774 38268.592606 8 clients 242598.113719 240728.545096 238609.937059 236060.993323 235654.171372 238064.785008 244072.880501 233852.274273 229449.357520 238786.728666 229605.148726 233260.304044 232185.071739 242031.025232 236335.954147 16 clients 365702.997309 366334.247594 371966.060740 376383.183023 375279.446966 375416.168457 368591.075982 367195.919635 355262.316767 368375.432125 361657.667234 386455.075762 365156.675183 372176.374557 376528.470288
1 client 35725.116701 34158.764487 36346.391570 35951.323419 33164.079463 35317.282661 35991.253869 36031.895254 36446.263439 33969.603184 32546.144221 36465.661628 36246.570646 35852.782217 34677.603098 8 clients 214320.502337 223890.751691 213921.995625 214569.182623 215144.759957 223018.095568 218089.639999 217755.461221 217259.258268 225533.656349 226289.201510 212267.029873 214954.731109 218113.356619 210749.229416 16 clients 342668.984341 356999.569014 350751.292688 353380.258701 347648.832986 346108.809856 340116.169279 355782.130289 350859.226050 348832.982648 354467.086425 349248.179668 359556.618191 347499.041862 346793.236680