On Fri, Jul 24, 2020 at 12:16 PM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > Maybe, but we're nowhere close to these limits. See this table which I > posted earlier: > > 2MB Planned Partitions: 64 HashAgg Batches: 4160 > 4MB Planned Partitions: 128 HashAgg Batches: 16512 > 8MB Planned Partitions: 256 HashAgg Batches: 21488 > 64MB Planned Partitions: 32 HashAgg Batches: 2720 > 256MB Planned Partitions: 8 HashAgg Batches: 8 > > This is from the non-parallel runs on the i5 machine with 32GB data set, > the first column is work_mem. We're nowhere near the 1024 limit, and the > cardinality estimates are pretty good. > > OTOH the number o batches is much higher, so clearly there was some > recursive spilling happening. What I find strange is that this grows > with work_mem and only starts dropping after 64MB.
Could that be caused by clustering in the data? If the input data is in totally random order then we have a good chance of never having to spill skewed "common" values. That is, we're bound to encounter common values before entering spill mode, and so those common values will continue to be usefully aggregated until we're done with the initial groups (i.e. until the in-memory hash table is cleared in order to process spilled input tuples). This is great because the common values get aggregated without ever spilling, and most of the work is done before we even begin with spilled tuples. If, on the other hand, the common values are concentrated together in the input... Assuming that I have this right, then I would also expect simply having more memory to ameliorate the problem. If you only have/need 4 or 8 partitions then you can fit a higher proportion of the total number of groups for the whole dataset in the hash table (at the point when you first enter spill mode). I think it follows that the "nailed" hash table entries/groupings will "better characterize" the dataset as a whole. > Also, how could the amount of I/O be almost constant in all these cases? > Surely more recursive spilling should do more I/O, but the Disk Usage > reported by explain analyze does not show anything like ... Not sure, but might that just be because of the fact that logtape.c can recycle disk space? As I said in my last e-mail, it's pretty reasonable to assume that the vast majority of external sorts are one-pass. It follows that disk usage can be thought of as almost the same thing as total I/O for tuplesort. But the same heuristic isn't reasonable when thinking about hash agg. Hash agg might write out much less data than the total memory used for the equivalent "peak optimal nospill" hash agg case -- or much more. (Again, reiterating what I said in my last e-mail.) -- Peter Geoghegan