On Wed, Mar 7, 2012 at 11:55 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes <jeff.ja...@gmail.com> wrote: > >> Anyway, I think the logtape could use redoing. When your tapes are >> actually physically tape drives, it is necessary to build up runs one >> after the other on physical tapes, because un-mounting a tape from a >> tape drive and remounting another tape is not very feasible at scale. >> Which then means you need to place your runs carefully, because if the >> final merge finds that two runs it needs are back-to-back on one tape, >> that is bad. But with disks pretending to be tapes, you could >> re-arrange the "runs" with just some book keeping. Maintaining the >> distinction between "tapes" and "runs" is pointless, which means the >> Fibonacci placement algorithm is pointless as well. > > I think you're right. It seems to me that we could just write out an > arbitrary number of runs, one per file, ending up with files number > 1..N.
The problem there is that none of the files can be deleted until it was entirely read, so you end up with all the data on disk twice. I don't know how often people run their databases so close to the edge on disk space that this matters, but someone felt that that extra storage was worth avoiding. We could still get rid of the run/tape distinction but keep the block recycling method--they are conceptually distinct things. (But hopefully improve the block recycling so the locality doesn't degrade as much as it seems to now) > If we can do a final merge of N runs without overrunning > work_mem, fine. If not, we merge the first K runs (for the largest > possible K) and write out a new run N+1. The range of valid run > number is now K+1..N+1. If those can be merged in a single pass, we > do so; otherwise we again merge the first K runs (K+1 through 2K) to > create a new run N+2. And so on. > > I am not clear, however, whether this would be any faster. It may not > be worth tinkering with just for the reduction in code complexity. Yeah, I was thinking it would only be worth redoing if the current implementation interferes with other improvements--which I think is likely, but don't have any concrete ideas yet where it would. ... >>> As a desirable side effect, I think it would mean >>> that we could dispense with retail palloc and pfree altogether. We >>> could just allocate a big chunk of memory, copy tuples into it until >>> it's full, using a pointer to keep track of the next unused byte, and >>> then, after writing the run, reset the allocation pointer back to the >>> beginning of the buffer. That would not only avoid the cost of going >>> through palloc/pfree, but also the memory overhead imposed by >>> bookkeeping and power-of-two rounding. >> >> Wouldn't we still need an array of pointers to the start of every >> tuple's location in the buffer? Or else, how would qsort know where >> to find them? > > Yes, we'd still need that. I don't see any way to get around that; I > just don't like the expense of palloc-ing so many little chunks in > addition to that array. Right, OK, I had thought you meant the power of two rounding of mem_tuples, I overlooked the power of two rounding of palloc. > >> Also, to do this we would need to get around the 1GB allocation limit. >> It is bad enough that memtuples is limited to 1GB, it would be much >> worse if the entire arena was limited to that amount. > > Well, we could always allocate multiple 1GB chunks and peel off pieces > from each one in turn. I thought of that with mem_tuples itself, but I think it would be more difficult to do that than just fixing the allocation limit would be. Using multiple chunks might be easier with the buffer space as you don't need to do heap arithmetic on those. >>> If we do want to stick with the current algorithm, there seem to be >>> some newer techniques for cutting down on the heap maintenance >>> overhead. Heikki's been investigating that a bit. >> >> Interesting. Is that investigation around the poor L1/L2 caching >> properties of large heaps? I was wondering if there might be a way to >> give tuplesort an initial estimate of how much data there was to sort, >> so that it could use a smaller amount of memory than the max if it >> decided that that would lead to better caching effects. Once you know >> you can't do an in-memory sort, then there is no reason to use more >> memory than the amount that lets you merge all the runs in one go. > > No, it's about reducing the number of comparisons needed to maintain > the heap property. That sounds very interesting. I didn't know it was even theoretically possible to do that. Cheers, Jeff -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers