Simon’s patch had the effect of reducing the number of passes by increasing the number of tapes depending on the memory available, but that’s a long tail effect as seen in figure (70?) in Knuth.
Where I’d like this to go is the implementation of a two pass “create runs, merge”, where the second merge can be avoided unless random access is needed (as discussed previously on list).
In the run creation phase, the idea would be to implement something like quicksort or another L2-cache friendly algorithm (ideas?)
On 2/19/06 8:19 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote:
"Luke Lonergan" <[EMAIL PROTECTED]> writes:
> So you know, we=B9ve done some more work on the external sort to remove the
> =B3tape=B2 abstraction from the code, which makes a significant improvement.
Improvement where? That code's down in the noise so far as I can tell.
I see results like this (with the patched code):
CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples % symbol name
147310 31.9110 tuplesort_heap_siftup
68381 14.8130 comparetup_index
34063 7.3789 btint4cmp
22573 4.8899 AllocSetAlloc
19317 4.1845 writetup_index
18953 4.1057 tuplesort_gettuple_common
18100 3.9209 mergepreread
17083 3.7006 GetMemoryChunkSpace
12527 2.7137 LWLockAcquire
11686 2.5315 LWLockRelease
6172 1.3370 tuplesort_heap_insert
5392 1.1680 index_form_tuple
5323 1.1531 PageAddItem
4943 1.0708 LogicalTapeWrite
4525 0.9802 LogicalTapeRead
4487 0.9720 LockBuffer
4217 0.9135 heapgettup
3891 0.8429 IndexBuildHeapScan
3862 0.8366 ltsReleaseBlock
It appears that a lot of the cycles blamed on tuplesort_heap_siftup are
due to cache misses associated with referencing memtuples entries
that have fallen out of L2 cache. Not sure how to improve that though.
regards, tom lane