> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
> [EMAIL PROTECTED] On Behalf Of Tom Lane
> Sent: Friday, September 30, 2005 11:02 PM
> To: Jeffrey W. Baker
> Cc: Luke Lonergan; Josh Berkus; Ron Peacetree; pgsql-
> [EMAIL PROTECTED]; email@example.com
> Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
> "Jeffrey W. Baker" <[EMAIL PROTECTED]> writes:
> > I think the largest speedup will be to dump the multiphase merge and
> > merge all tapes in one pass, no matter how large M. Currently M is
> > capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes
> > the tape. It could be done in a single pass heap merge with
> > comparisons, and, more importantly, far less input and output.
> I had more or less despaired of this thread yielding any usable ideas
> :-( but I think you have one here.
I believe I made the exact same suggestion several days ago.
>The reason the current code uses a
> six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
> edition) shows that there's not much incremental gain from using more
> tapes ... if you are in the regime where number of runs is much
> than number of tape drives. But if you can stay in the regime where
> only one merge pass is needed, that is obviously a win.
> I don't believe we can simply legislate that there be only one merge
> pass. That would mean that, if we end up with N runs after the
> run-forming phase, we need to fit N tuples in memory --- no matter how
> large N is, or how small work_mem is. But it seems like a good idea
> try to use an N-way merge where N is as large as work_mem will allow.
> We'd not have to decide on the value of N until after we've completed
> the run-forming phase, at which time we've already seen every tuple
> once, and so we can compute a safe value for N as work_mem divided by
> largest_tuple_size. (Tape I/O buffers would have to be counted too
> of course.)
You only need to hold the sort column(s) in memory, except for the queue
you are exhausting at the time. [And of those columns, only the values
for the smallest one in a sub-list.] Of course, the more data from each
list that you can hold at once, the fewer the disk reads and seeks.
Another idea (not sure if it is pertinent):
Instead of having a fixed size for the sort buffers, size it to the
query. Given a total pool of size M, give a percentage according to the
difficulty of the work to perform. So a query with 3 small columns and
a cardinality of 1000 gets a small percentage and a query with 10 GB of
data gets a big percentage of available sort mem.
> It's been a good while since I looked at the sort code, and so I don't
> recall if there are any fundamental reasons for having a compile-time-
> constant value of the merge order rather than choosing it at runtime.
> My guess is that any inefficiencies added by making it variable would
> be well repaid by the potential savings in I/O.
> regards, tom lane
> ---------------------------(end of
> TIP 6: explain analyze is your friend
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend