Getting back to improvements for small sort runs, it might make sense to 
consider using in-register based sorting via sorting networks for very small 
runs.

This talk is specific to database sorting and illustrates how such an approach 
can be vectorized: 
https://youtu.be/HeFwVNHsDzc?list=PLSE8ODhjZXjasmrEd2_Yi1deeE360zv5O&t=1090

It looks like some of the commercial DBMSs do this very successfully. They use 
4 512bit registers (AVX-512) in this example, which could technically store 4 * 
4 sort-elements (given that the sorting key is 64 bit and the tuple pointer is 
64 bit). I wonder whether this could also be done with just SSE (instead of 
AVX), which the project now readily supports thanks to your recent efforts?


Reply via email to