On Tue, Mar 29, 2016 at 6:02 PM, Tomas Vondra
> That may be easily due to differences between the CPUs and configuration.
> For example the Xeon uses a way older CPU with different amounts of CPU
> cache, and it's also a multi-socket system. And so on.
So, having searched past threads I guess this was your Xeon E5450,
which has a 12MB cache. I also see that you have an Intel Core
i5-2500K Processor, which has 6MB of L2 cache. This hardware is
mid-end, and the CPUs were discontinued in 2010 and 2013 respectively.
Now, the i5 has a smaller L2 cache, so if anything I'd expect it to do
worse than the Xeon, not better. But leaving that aside, I think there
is an issue that we don't want to lose sight of. Which is: In most of
the regressions we were discussing today, perhaps the entire heap
structure can fit in L2 cache. This would be true for stuff like int4
CREATE INDEX builds, where a significant fraction of memory is used
for IndexTuples, which most or all comparisons don't have to read in
memory. This is the case with a CPU that was discontinued by the
manufacturer just over 5 years ago. I think this is why "padding"
cases can make the patch look not much better and occasionally worse
at the low end: Those keep the number of memtuples as a fraction of
work_mem very low, and so mask the problems with the replacement
When Greg Stark benchmarked the patch at the low end, to identify
regressions, he did find some slight regressions at the lowest
work_mem settings with many many passes, but they were quite small
. Greg also did some good analysis of the performance
characteristics of external sorting today  that I recommend reading
if you missed. It's possible that those regressions have since been
fixed, because Greg did not apply/test the memory batching patch that
became commit 0011c0091e886b as part of this. It seems likely that
it's at least partially fixed, and it might even be better than master
Anyway, what I liked about Greg's approach to finding regressions at
the low end was that when testing, he used the cheapest possible VM
available on Google's cloud platform. When testing the low end, he had
low end hardware to go with the low end work_mem settings. This gave
the patch the benefit of using quicksort to make good use of what I
assume is a far smaller L2 cache; certainly nothing like 6MB or 12MB.
I think Greg might have used a home server to test my patch in ,
actually, but I understand that it too was suitably low-end.
It's perfectly valid to bring economics into this; typically, an
external sort occurs only because memory isn't infinitely affordable,
or it isn't worth provisioning enough memory to be totally confident
that you can do every sort internally. With external sorting, the
constant factors are what researchers generally spend most of the time
worrying about. Knuth spends a lot of time discussing how the
characteristics of actual magnetic tape drives changed throughout the
1970s in TAOCP Volume III.
It's quite valid to ask if anyone would actually want to have an 8MB
work_mem setting on a machine that has 12MB of L2 cache, cache that an
external sort gets all to itself. Is that actually a practical setup
that anyone would want to use?
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: