Well, sorry for hijacking... ummm how did I do that?
Anyway I'll thank you for giving a "sign of life" when I was almost loosing
my hopes to get any kind of answer from "-hackers".
I suppose the lack of answers was due to the way I wrote my mail. At that
moment I supposed that at least someone reminded the 2WRS technique and
possible benefits described into previous posts.
I think I was wrong, so I'll write it once again hoping meanwhile to get
some suggestions on: HOWTO write a mail to which "-hackers" will give an
answer :) hehehehe
Thanks for your attention.
Manolo.
--------------------------------------------------
From: "Decibel!" <[EMAIL PROTECTED]>
Sent: Tuesday, January 08, 2008 12:34 AM
To: "Manolo _" <[EMAIL PROTECTED]>
Cc: <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Implementing Sorting Refinements
You'll get better response if you don't hijack threads. :) Also, there's
nothing in here that describes what the benefits of this change are.
On Jan 1, 2008, at 2:09 PM, Manolo _ wrote:
Hi to all.
This mail is aimed at asking some suggestion to face PostgreSQL (PG)
development to implement a refinement to PG source code. I'll briefly
summarize the idea of the "2-Way Replacement Selection" (2WRS)
refinement, just in case. If you already remember what 2WRS is, you can
please jump directly to the IMPLEMENTATION ISSUES part of this mail.
2WRS.
Refinement of the Replacement Selection (RS) technique currently used by
PG in src/backend/utils/sort/tuplesort.c .
The 2WRS uses two heaps instead of just one in order to create the
current (logical) run. Here there are some fundamental points of the
2WRS technique:
- 'qsort' the initial unsorted 'memtuples' array
- divide the 'memtuples' array into two parts and each of those will be
managed as a heap
- one of the heaps will arrange its elements in ascending order, while
the other heap in descending order
- each heap will spill its current root in its corresponding run (i.e.:
we have a run per each of those two heaps), so we are actually creating
2 physical current runs
- those 2 current physical runs could "theoretically" be merged into the
same logical run, actually we can make 'mergesort' think they do belong
to the same physical run. That reduces the number of comparisons
'mergesort' has to do at each merge step (that means less seek delay
time on mass storage). We can also think the average length of logical
runs produced by 2WRS will probably be greater than the average length
produced by RS (again less seek delay time: the longer each run the less
number of final runs produced, that means the less number of comparisons
at each 'mergesort' step).
IMPLEMENTATION ISSUES.
Where to place those heaps?
1) I think that both heaps could be arranged on the same 'memtuples'
array. That allows easily subsequent resizing those heaps according to
their effective use or according to some heuristic, without reallocating
memory.
How to arrange those heaps?
2a) It's convenient to arrange those heaps "root to root". That is
arranging those heaps with their roots toward the center of 'memtuples'
(in a way we can say they lay "face to face", or "root to root" as said
before) while their leaves lay towards the extreme indexes of the
'memtuples' array (that is the last leaf of one heap will lay at index
0, the last leaf of the other heap laying at index memtupsize-1. This
arrangement prevents overlapping elements between those physical runs
associated to the same current logical run.
PRO: once we qsort memtuples and divide it into 2 parts we already get
those two heaps, no need to build them.
CONTRA: ???
2b) As in 2a) but arranging heaps "leaf to leaf", that is their roots
will lay at the extreme indexes of 'memtuples' while their leaves
towards the center of the 'memtuples' array.
Or even start building heaps as soon as we get initial elements, instead
of qsort the whole 'memtuples' array.
Any PRO/CONTRA compared to 2a)???
Current run numbers
I think I should duplicate the 'int currentRun' variable in the
Tuplesortstate struct. I'll replace it with a 'int currentRunUP' and
'int currentRunDOWN' variables in order to distinguish those two
physical runs associated to those 2 heaps. In this case I will give a
run number (max{currentRunUP,currentRunDOWN} + 1) to elements not
belonging to the current logical run. I suppose no need to touch 'long
availMem' nor 'long allowedMem' variables nor any others.
Heap functions
I will duplicate all the heap management functions in order to adapt
them to the kind of heap they should be applied to (for example, the
tuplesort_heap_siftup function should be replaced with
tuplesort_heap_siftupUP and tuplesort_heap_siftupDOWN functions).
Merge Plan
This technique would use a sort of "merge plan" to instruct mergesort on
how to use those physical runs. Actually mergesort should consider at
first "odd runs" before "pair runs". That is, for example, mergesort
should start merging runs with run number 1,3,5,7,... and when run
number X terminates start considering run number X+1. Obviously that
doesn't need any merge plan, but I remember someone else (as Simon
Riggs) was interested in sorting improvements so it could be a good
thing to know if I should consider any conventions or paramethers in
order to possibly create that merge plan.
DEVELOPMENT CONTEXT
I preferred to use the "last stable release" at the moment, that is 8.2.
Any comment/suggestion/advice ?
Thanks for your attention and for your time.
Regards, Manolo.
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org
--
Decibel!, aka Jim C. Nasby, Database Architect [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster