Just to add a little anarchy in your nice debate...

        Who really needs all the results of a sort on your terabyte table ?

I guess not many people do a SELECT from such a table and want all the results. So, this leaves :
        - Really wanting all the results, to fetch using a cursor,
        - CLUSTER type things, where you really want everything in order,
- Aggregates (Sort->GroupAggregate), which might really need to sort the whole table. - Complex queries where the whole dataset needs to be examined, in order to return a few values
        - Joins (again, the whole table is probably not going to be selected)
        - And the ones I forgot.

        However,

        Most likely you only want to SELECT N rows, in some ordering :
        - the first N (ORDER BY x LIMIT N)
        - last N (ORDER BY x DESC LIMIT N)
        - WHERE x>value ORDER BY x LIMIT N
        - WHERE x<value ORDER BY x DESC LIMIT N
        - and other variants

Or, you are doing a Merge JOIN against some other table ; in that case, yes, you might need the whole sorted terabyte table, but most likely there are WHERE clauses in the query that restrict the set, and thus, maybe we can get some conditions or limit values on the column to sort.

Also the new, optimized hash join, which is more memory efficient, might cover this case.

Point is, sometimes, you only need part of the results of your sort. And the bigger the sort, the most likely it becomes that you only want part of the results. So, while we're in the fun hand-waving, new algorithm trying mode, why not consider this right from the start ? (I know I'm totally in hand-waving mode right now, so slap me if needed).

        I'd say your new, fancy sort algorithm needs a few more input values :

        - Range of values that must appear in the final result of the sort :
none, minimum, maximum, both, or even a set of values from the other side of the join, hashed, or sorted.
        - LIMIT information (first N, last N, none)
- Enhanced Limit information (first/last N values of the second column to sort, for each value of the first column) (the infamous "top10 by category" query)
        - etc.

With this, the amount of data that needs to be kept in memory is dramatically reduced, from the whole table (even using your compressed keys, that's big) to something more manageable which will be closer to the size of the final result set which will be returned to the client, and avoid a lot of effort.

So, this would not be useful in all cases, but when it applies, it would be really useful.

        Regards !

        
















---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to