On Sat, Jan 21, 2012 at 4:51 PM, Peter Geoghegan <pe...@2ndquadrant.com> wrote: > On 16 January 2012 00:59, Jeff Janes <jeff.ja...@gmail.com> wrote: >> I think it would be better to pre-deduct the tape overhead amount we >> will need if we decide to switch to tape sort from the availMem before >> we even start reading (and then add it back if we do indeed make that >> switch). That way we wouldn't over-run the memory in the first place. >> However, that would cause apparent regressions in which sorts that >> previously fit into maintenance_work_mem no longer do. Boosting >> maintenance_work_mem to a level that was actually being used >> previously would fix those regressions, but pointing out that the >> previous behavior was not optimal doesn't change the fact that people >> are used to it and perhaps tuned to it. So the attached patch seems >> more backwards-friendly. > > Hmm. Are people really setting maintenance_work_mem such that it is > exactly large enough to quicksort when building an index in one case > but not another?
It could also apply to work_mem in other situations. I'm just using maintenance_work_mem in my example because index creation is the thing that lead me into this little diversion and so that is what my test-bed is set up to use. Some people do have very stable "ritualized" work-loads and so could have tuned exactly to them. I don't know how common that would be. More common might be synthetic benchmarks which had been tuned, either intentionally or accidentally, to just barely fit in the (overshot) memory, so it would be a perception problem that there was a regression when in fact the tuning merely became out of date. > Is the difference large enough to warrant avoiding > pre-deduction? Switching to an external sort when you could have done it all by quick sort (by cheating just a little bit on memory usage) makes a pretty big difference, over 2 fold in my tests. If the fast-path optimization for qsort goes in, the difference might get even bigger. Of course, there will always be some point at which that switch over must occur, so the real question is do we need to keep that switch point historically consistent to avoid nasty surprises for people with excessively fine-tuned *work_mem settings based on old behavior? And do such people even exist outside of my imagination? But a bigger question I now have is if any of this matters. Since there is currently no way to know how many connections might be running how many concurrent sorts, there is really no way to tune work_mem with much precision. So, does it matter if we slightly lie about how much we will actually use? And is it worth worrying about whether every last drop of memory is used efficiently? I was trying to compare the performance I was getting with a theoretical model of what I should get, and it was confusing to me that we were using a originally defined size of memory for the priority heap, and then later reducing that amount of memory. That is annoying because it complicates the theoretical model, and even more annoying when I realized that the "freed" memory that results from doing this is too fragmented to even be used for other purposes. But theoretical annoyances aside, it is hardly the worst thing about memory usage in sorting. It is just one that is standing in the way of my further study. So I don't think this patch I proposed is particularly important in its own right. I want to see if anyone will point out that I am all wet because my analysis failed to take <foo> into account. I probably should have explained this when I submitted the patch. The worst thing about the current memory usage is probably that big machines can't qsort more than 16,777,216 tuples no matter how much memory they have, because memtuples has to live entirely within a single allocation, which is currently limited to 1GB. It is probably too late to fix this problem for 9.2. I wish I had started there rather than here. This 16,777,216 tuple limitation will get even more unfortunate if the specializations that speed up qsort but not external sort get accepted. Attached is a completely uncommitable proof of concept/work in progress patch to get around the limitation. It shows a 2 fold improvement when indexing an integer column on a 50,000,000 row randomly ordered table. As for what to do to make it commitable, it seems like maybe there should be two different MaxAllocSize. One determined by the "aset.c assumes they can compute twice an allocation's size without overflow". I would think that simply testing size_t at compile time would be enough. Allowing more than 1 GB allocations would probably be pointless on 32 bit machines, and 64 bit should be able to compute twice of a much larger value than 1GB without overflow. For the varlena stuff, I don't know what to do. Is the "I swear to God I won't pass this pointer to one of those functions" sufficient? Or does each function need to test it? And I'm pretty sure that palloc, not just repalloc, would need an over-size alternative to make this a general-purpose improvement. Cheers, Jeff
Description: Binary data
-- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers