Batch splitting shouldn't be followed by a hash function change? On Mon, Apr 22, 2019, 05:09 Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
> On Sun, Apr 21, 2019 at 11:40:22AM -0500, Justin Pryzby wrote: > >On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote: > >> Jeff Janes <jeff.ja...@gmail.com> writes: > >> > The growEnabled stuff only prevents infinite loops. It doesn't > prevent > >> > extreme silliness. > >> > >> > If a single 32 bit hash value has enough tuples by itself to not fit > in > >> > work_mem, then it will keep splitting until that value is in a batch > by > >> > itself before shutting off > >> > >> I suspect, however, that we might be better off just taking the > existence > >> of the I/O buffers into account somehow while deciding whether it's > worth > >> growing further. That is, I'm imagining adding a second independent > >> reason for shutting off growEnabled, along the lines of "increasing > >> nbatch any further will require an unreasonable amount of buffer > memory". > >> The question then becomes how to define "unreasonable". > > > >On Sun, Apr 21, 2019 at 06:15:25PM +0200, Tomas Vondra wrote: > >> I think the question the code needs to be asking is "If we double the > >> number of batches, does the amount of memory we need drop?" And the > >> memory needs to account both for the buffers and per-batch data. > >> > >> I don't think we can just stop increasing the number of batches when the > >> memory for BufFile exceeds work_mem, because that entirely ignores the > >> fact that by doing that we force the system to keep the per-batch stuff > >> in memory (and that can be almost arbitrary amount). > >... > >> Of course, this just stops enforcing work_mem at some point, but it at > >> least attempts to minimize the amount of memory used. > > > >This patch defines reasonable as "additional BatchFiles will not > themselves > >exceed work_mem; OR, exceeded work_mem already but additional BatchFiles > are > >going to save us RAM"... > > > > OK. > > >I think the first condition is insensitive and not too important to get > right, > >it only allows work_mem to be exceeded by 2x, which maybe already happens > for > >multiple reasons, related to this thread and otherwise. It'd be fine to > slap > >on a factor of /2 or /4 or /8 there too. > > > > TBH I'm not quite sure I understand all the conditions in the patch - it > seems unnecessarily complicated. And I don't think it actually minimizes > the amount of memory used for hash table + buffers, because it keeps the > same spaceAllowed (which pushes nbatches up). At some point it actually > makes to bump spaceAllowed and make larger batches instead of adding > more batches, and the patch does not seem to do that. > > Also, the patch does this: > > if (hashtable->nbatch*sizeof(PGAlignedBlock) < hashtable->spaceAllowed) > { > ExecHashIncreaseNumBatches(hashtable); > } > else if (hashtable->spaceUsed/2 >= hashtable->spaceAllowed) > { > /* Exceeded spaceAllowed by 2x, so we'll save RAM by allowing > nbatches to increase */ > /* I think this branch would be hit almost same as below branch */ > ExecHashIncreaseNumBatches(hashtable); > } > ... > > but the reasoning for the second branch seems wrong, because > > (spaceUsed/2 >= spaceAllowed) > > is not enough to guarantee that we actually save memory by doubling the > number of batches. To do that, we need to make sure that > > (spaceUsed/2 >= hashtable->nbatch * sizeof(PGAlignedBlock)) > > But that may not be true - it certainly is not guaranteed by not getting > into the first branch. > > Consider an ideal example with uniform distribution: > > create table small (id bigint, val text); > create table large (id bigint, val text); > > insert into large select 1000000000 * random(), md5(i::text) > from generate_series(1, 700000000) s(i); > > insert into small select 1000000000 * random(), md5(i::text) > from generate_series(1, 10000) s(i); > > vacuum analyze large; > vacuum analyze small; > > update pg_class set (relpages, reltuples) = (1000000, 1) > where relname = 'large'; > > update pg_class set (relpages, reltuples) = (1, 1000000) > where relname = 'small'; > > set work_mem = '1MB'; > > explain analyze select * from small join large using (id); > > A log after each call to ExecHashIncreaseNumBatches says this: > > nbatch=2 spaceUsed=463200 spaceAllowed=1048576 BufFile=16384 > nbatch=4 spaceUsed=463120 spaceAllowed=1048576 BufFile=32768 > nbatch=8 spaceUsed=457120 spaceAllowed=1048576 BufFile=65536 > nbatch=16 spaceUsed=458320 spaceAllowed=1048576 BufFile=131072 > nbatch=32 spaceUsed=457120 spaceAllowed=1048576 BufFile=262144 > nbatch=64 spaceUsed=459200 spaceAllowed=1048576 BufFile=524288 > nbatch=128 spaceUsed=455600 spaceAllowed=1048576 BufFile=1048576 > nbatch=256 spaceUsed=525120 spaceAllowed=1048576 BufFile=2097152 > nbatch=256 spaceUsed=2097200 spaceAllowed=1048576 BufFile=2097152 > nbatch=512 spaceUsed=2097200 spaceAllowed=1048576 BufFile=4194304 > nbatch=1024 spaceUsed=2097200 spaceAllowed=1048576 BufFile=8388608 > nbatch=2048 spaceUsed=2097200 spaceAllowed=1048576 BufFile=16777216 > nbatch=4096 spaceUsed=2097200 spaceAllowed=1048576 BufFile=33554432 > nbatch=8192 spaceUsed=2097200 spaceAllowed=1048576 BufFile=67108864 > nbatch=16384 spaceUsed=2097200 spaceAllowed=1048576 BufFile=134217728 > > So we've succeeded in keeping spaceUsed below 2*spaceAllowed (which > seems rather confusing, BTW), but we've allocated 128MB for BufFile. So > about 130MB in total. With 16k batches. > > What I think might work better is the attached v2 of the patch, with a > single top-level condition, comparing the combined memory usage > (spaceUsed + BufFile) against spaceAllowed. But it also tweaks > spaceAllowed once the size needed for BufFile gets over work_mem/3. > > And it behaves like this: > > nbatch=2 spaceUsed=458640 spaceAllowed=1048576 BufFile=16384 > nbatch=4 spaceUsed=455040 spaceAllowed=1048576 BufFile=32768 > nbatch=8 spaceUsed=440160 spaceAllowed=1048576 BufFile=65536 > nbatch=16 spaceUsed=426560 spaceAllowed=1048576 BufFile=131072 > nbatch=32 spaceUsed=393200 spaceAllowed=1048576 BufFile=262144 > nbatch=64 spaceUsed=329120 spaceAllowed=1572864 BufFile=524288 > nbatch=128 spaceUsed=455600 spaceAllowed=3145728 BufFile=1048576 > nbatch=256 spaceUsed=987440 spaceAllowed=6291456 BufFile=2097152 > nbatch=512 spaceUsed=2040560 spaceAllowed=12582912 BufFile=4194304 > nbatch=1024 spaceUsed=4114640 spaceAllowed=25165824 BufFile=8388608 > nbatch=2048 spaceUsed=8302880 spaceAllowed=50331648 BufFile=16777216 > > So we end up with just 2k batches, using ~24MB of memory in total. > That's because the spaceAllowed limit was bumped up instead of adding > more and more batches. > > >The current patch doesn't unset growEnabled, since there's no point at > which > >the hashtable should grow without bound: if hash tables are *already* > exceeding > >work_mem by 2x as big, nbatches should be doubled. > > > > Not sure. I guess it might be useful to re-evaluate the flag after a > while - not necessarily by actually enabling it right away, but just > checking if it'd move any tuples. Just disabling it once may be an issue > when the input data is somehow correlated, which seems to be one of the > issues with the data set discussed in this thread. > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >