Getting back to improvements for small sort runs, it might make sense to consider using in-register based sorting via sorting networks for very small runs.
This talk is specific to database sorting and illustrates how such an approach can be vectorized: https://youtu.be/HeFwVNHsDzc?list=PLSE8ODhjZXjasmrEd2_Yi1deeE360zv5O&t=1090 It looks like some of the commercial DBMSs do this very successfully. They use 4 512bit registers (AVX-512) in this example, which could technically store 4 * 4 sort-elements (given that the sorting key is 64 bit and the tuple pointer is 64 bit). I wonder whether this could also be done with just SSE (instead of AVX), which the project now readily supports thanks to your recent efforts?