On 04/07/2024 3:45 pm, Yao Wang wrote:
Generally, the benefit of mksort is mainly from duplicated values and sort keys: the more duplicated values and sort keys are, the bigger benefit it gets.
...
1. Use distinct stats info of table to enable mksort It's kind of heuristics: in optimizer, check Form_pg_statistic->stadistinct of a table via pg_statistics. Enable mksort only when it is less than a threshold. The hacked code works, which need to modify a couple of interfaces of optimizer. In addition, a complete solution should consider types and distinct values of all columns, which might be too complex, and the benefit seems not so big.
If mksort really provides advantage only when there are a lot of duplicates (for prefix keys?) and of small fraction of duplicates there is even some (small) regression then IMHO taking in account in planner information about estimated number of distinct values seems to be really important. What was a problem with accessing this statistics and why it requires modification of optimizer interfaces? There is `get_variable_numdistinct` function which is defined and used only in selfuncs.c
Information about values distribution seems to be quite useful for choosing optimal sort algorithm. Not only for multi-key sort optimization. For example if we know min.max value of sort key and it is small, we can use O(N) algorithm for sorting. Also it can help to estimate when TOP-N search is preferable.
Right now Posgres creates special path for incremental sort. I am not sure if we also need to be separate path for mk-sort. But IMHO if we need to change some optimizer interfaces to be able to take in account statistic and choose preferred sort algorithm at planning time, then it should be done. If mksort can increase sort more than two times (for large number of duplicates), it will be nice to take it in account when choosing optimal plan.
Also in this case we do not need extra GUC for explicit enabling of mksort. There are too many parameters for optimizer and adding one more will make tuning more complex. So I prefer that decision is take buy optimizer itself based on the available information, especially if criteria seems to be obvious.
Best regards, Konstantin