On Sat, Mar 3, 2012 at 4:15 PM, Jeff Janes <jeff.ja...@gmail.com> wrote: > The better solution would be to reduce the overhead in the first > place. While building the initial runs, there is no reason to have 3 > blocks worth of overhead for each tape, when only one tape is ever > being used at a time. But that change seems much tougher to > implement.
Hmm, yeah. > Anyway, I think the logtape could use redoing. When your tapes are > actually physically tape drives, it is necessary to build up runs one > after the other on physical tapes, because un-mounting a tape from a > tape drive and remounting another tape is not very feasible at scale. > Which then means you need to place your runs carefully, because if the > final merge finds that two runs it needs are back-to-back on one tape, > that is bad. But with disks pretending to be tapes, you could > re-arrange the "runs" with just some book keeping. Maintaining the > distinction between "tapes" and "runs" is pointless, which means the > Fibonacci placement algorithm is pointless as well. I think you're right. It seems to me that we could just write out an arbitrary number of runs, one per file, ending up with files number 1..N. If we can do a final merge of N runs without overrunning work_mem, fine. If not, we merge the first K runs (for the largest possible K) and write out a new run N+1. The range of valid run number is now K+1..N+1. If those can be merged in a single pass, we do so; otherwise we again merge the first K runs (K+1 through 2K) to create a new run N+2. And so on. I am not clear, however, whether this would be any faster. It may not be worth tinkering with just for the reduction in code complexity. > But it would mean we have about 1.7x more runs that need to be merged > (for initially random data). Considering the minimum merge order is > 6, that increase in runs is likely not to lead to an additional level > of merging, in which case the extra speed of building the runs would > definitely win. But if it does cause an additional level of merge, it > could end up being a loss. That's true, but the real hit to the run size should be quite a bit less than 1.7x, because we'd also be using memory more efficiently, and therefore more tuples should fit. A little debugging code suggests that on my MacBook Pro, things come out like this: - Allocations of 8 bytes of less use 24 bytes. - Allocations of 9-16 bytes use 32 bytes. - Allocations of 17-32 bytes use 48 bytes. - Allocations of 33-64 bytes use 80 bytes. - Allocations of 65-128 bytes use 144 bytes. - Allocations of 129-256 bytes use 272 bytes. In other words, the palloc overhead seems to be: round up to the next power of two, and add 16 bytes. This means that if, say, all the rows are exactly 100 bytes, we're using 44% more memory than we would under this scheme, which means that the runs would be 44% longer than otherwise. Of course that's just an example - if you are sorting 132 byte rows, you'll be able to squeeze in 106% more of them, and if you're sorting 128 byte rows, you'll only be able to squeeze in 12.5% more of them. So there are certainly cases where the runs would get shorter, especially if you happened to have mostly-sorted data rather than randomly-ordered data, but not in every case. Still, it may be a terrible idea; I only mention it because I did find a lot of references to people using quicksort to build up the initial runs rather than our heapsort-based approach. > Is there some broad corpus of sorting benchmarks which changes could > be tested against? I usually end up testing just simple columns of > integers or small strings, because they are easy to set up. That is > not ideal. I don't know of anything more formal, unfortunately. >> As a desirable side effect, I think it would mean >> that we could dispense with retail palloc and pfree altogether. We >> could just allocate a big chunk of memory, copy tuples into it until >> it's full, using a pointer to keep track of the next unused byte, and >> then, after writing the run, reset the allocation pointer back to the >> beginning of the buffer. That would not only avoid the cost of going >> through palloc/pfree, but also the memory overhead imposed by >> bookkeeping and power-of-two rounding. > > Wouldn't we still need an array of pointers to the start of every > tuple's location in the buffer? Or else, how would qsort know where > to find them? Yes, we'd still need that. I don't see any way to get around that; I just don't like the expense of palloc-ing so many little chunks in addition to that array. > Also, to do this we would need to get around the 1GB allocation limit. > It is bad enough that memtuples is limited to 1GB, it would be much > worse if the entire arena was limited to that amount. Well, we could always allocate multiple 1GB chunks and peel off pieces from each one in turn. >> If we do want to stick with the current algorithm, there seem to be >> some newer techniques for cutting down on the heap maintenance >> overhead. Heikki's been investigating that a bit. > > Interesting. Is that investigation around the poor L1/L2 caching > properties of large heaps? I was wondering if there might be a way to > give tuplesort an initial estimate of how much data there was to sort, > so that it could use a smaller amount of memory than the max if it > decided that that would lead to better caching effects. Once you know > you can't do an in-memory sort, then there is no reason to use more > memory than the amount that lets you merge all the runs in one go. No, it's about reducing the number of comparisons needed to maintain the heap property. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers