On Mon, Dec 21, 2009 at 11:43 AM, Pavel Stehule <pavel.steh...@gmail.com> wrote: >> If we don't find a way to optimize this for on-disk tuplestores though >> then I wonder whether it's worth it. It would only help in the narrow >> case that you had a large enough set to see the difference between >> doing quickselect and quicksort but small enough to fit in workmem. I >> suppose that case is widening over time though as memory sizes get >> bigger and bigger. > > Thank you > > I have to do same test and get more info about quickselect
So my first thought of how to do median efficiently on large on-disk data structures is to do a first pass, attempting to find as narrow a target partition as possible that the median definitely falls in. Then do a second pass counting how many tuples fall before and after the target partition and accumulating all the tuples in the range and perform quickselect to find the right tuple in the target partition. I do wonder To find a target partition you could use quickselect itself to find a set of medians but I think it will boil down to a kind of huffman-like tree encoding. You want to accumulate values as degenerate singleton partitions until they don't fit in work_mem. When you exhaust work_mem you collapse two adjacent "partitions" into a single partition covering the whole range with a count of 2. You keep going collapsing the two lowest count ranges each time (perhaps preferring to collapse ranges far from the current median?). When you're done you can easily determine which range contains the median. You then have to scan the whole original input again skipping any tuple that falls outside that partition. If it still doesn't fit in work_mem you have to repeat the algorithm. This looks like a lot of code for a narrow use case though. And it would do a minimum of two passes through the input which is going to mean it'll perform similarly to our regular tape sort until the inputs get very large and tapesort needs to do multiple passes. At that point it's possible this algorithm might need multiple passes as well, though hopefully of smaller and smaller sets. I thought I would put it down in an email in case anyone's interested in experimenting though. Also, I assume I've just reinvented some standard algorithm I should remember from school, though I can't tell which offhand. It looks kind of like mergesort except since we're just doing selection we don't need to actually do the merge step, just keep track of which values we would have merged. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers