> -----Original Message----- > From: [EMAIL PROTECTED] [mailto:pgsql-hackers- > [EMAIL PROTECTED] On Behalf Of Tom Lane > Sent: Friday, September 30, 2005 11:02 PM > To: Jeffrey W. Baker > Cc: Luke Lonergan; Josh Berkus; Ron Peacetree; pgsql- > [EMAIL PROTECTED]; pgsql-performance@postgresql.org > Subject: Re: [HACKERS] [PERFORM] A Better External Sort? > > "Jeffrey W. Baker" <[EMAIL PROTECTED]> writes: > > I think the largest speedup will be to dump the multiphase merge and > > merge all tapes in one pass, no matter how large M. Currently M is > > capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over > > the tape. It could be done in a single pass heap merge with N*log(M) > > comparisons, and, more importantly, far less input and output. > > I had more or less despaired of this thread yielding any usable ideas > :-( but I think you have one here.
I believe I made the exact same suggestion several days ago. >The reason the current code uses a > six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first > edition) shows that there's not much incremental gain from using more > tapes ... if you are in the regime where number of runs is much greater > than number of tape drives. But if you can stay in the regime where > only one merge pass is needed, that is obviously a win. > > I don't believe we can simply legislate that there be only one merge > pass. That would mean that, if we end up with N runs after the initial > run-forming phase, we need to fit N tuples in memory --- no matter how > large N is, or how small work_mem is. But it seems like a good idea to > try to use an N-way merge where N is as large as work_mem will allow. > We'd not have to decide on the value of N until after we've completed > the run-forming phase, at which time we've already seen every tuple > once, and so we can compute a safe value for N as work_mem divided by > largest_tuple_size. (Tape I/O buffers would have to be counted too > of course.) You only need to hold the sort column(s) in memory, except for the queue you are exhausting at the time. [And of those columns, only the values for the smallest one in a sub-list.] Of course, the more data from each list that you can hold at once, the fewer the disk reads and seeks. Another idea (not sure if it is pertinent): Instead of having a fixed size for the sort buffers, size it to the query. Given a total pool of size M, give a percentage according to the difficulty of the work to perform. So a query with 3 small columns and a cardinality of 1000 gets a small percentage and a query with 10 GB of data gets a big percentage of available sort mem. > It's been a good while since I looked at the sort code, and so I don't > recall if there are any fundamental reasons for having a compile-time- > constant value of the merge order rather than choosing it at runtime. > My guess is that any inefficiencies added by making it variable would > be well repaid by the potential savings in I/O. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend