>From: Pailloncy Jean-Gerard <[EMAIL PROTECTED]>
>Sent: Sep 29, 2005 7:11 AM
>Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>>>Your main example seems to focus on a large table where a key
>>>column has constrained values. This case is interesting in
>>>proportion to the number of possible values. If I have billions
>>>of rows, each having one of only two values, I can think of a
>>>trivial and very fast method of returning the table "sorted" by
>>>that key: make two sequential passes, returning the first value
>>>on the first pass and the second value on the second pass.
>>> This will be faster than the method you propose.
>>1= No that was not my main example. It was the simplest example
>>used to frame the later more complicated examples. Please don't
>>get hung up on it.
>>2= You are incorrect. Since IO is the most expensive operation we
>>can do, any method that makes two passes through the data at top
>>scanning speed will take at least 2x as long as any method that only
>>takes one such pass.
>You do not get the point.
>As the time you get the sorted references to the tuples, you need to
>fetch the tuples themself, check their visbility, etc. and returns
>them to the client.
As PFC correctly points out elsewhere in this thread, =maybe= you
have to do all that. The vast majority of the time people are not
going to want to look at a detailed record by record output of that
The most common usage is to calculate or summarize some quality
or quantity of the data and display that instead or to use the tuples
or some quality of the tuples found as an intermediate step in a
longer query process such as a join.
Sometimes there's a need to see _some_ of the detailed records; a
random sample or a region in a random part of the table or etc.
It's rare that there is a RW need to actually list every record in a
table of significant size.
On the rare occasions where one does have to return or display all
records in such large table, network IO and/or display IO speeds
are the primary performance bottleneck. Not HD IO.
Nonetheless, if there _is_ such a need, there's nothing stopping us
from rearranging the records in RAM into sorted order in one pass
through RAM (using at most space for one extra record) after
constructing the cache conscious Btree index. Then the sorted
records can be written to HD in RAM buffer sized chunks very
Repeating this process until we have stepped through the entire
data set will take no more HD IO than one HD scan of the data
and leave us with a permanent result that can be reused for
multiple purposes. If the sorted records are written in large
enough chunks, rereading them at any later time can be done
at maximum HD throughput
In a total of two HD scans (one to read the original data, one
to write out the sorted data) we can make a permanent
rearrangement of the data. We've essentially created a
cluster index version of the data.
>So, if there is only 2 values in the column of big table that is larger
>than available RAM, two seq scans of the table without any sorting
>is the fastest solution.
If you only need to do this once, yes this wins. OTOH, if you have
to do this sort even twice, my method is better.
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster