> convenient if you know the type at compile time. See the attached,
> which I had laying around when I was looking at PDQuicksort. I believe
> it's similar to what you have in mind.

That looks very promising.
I also love your recent proposal of partitioning into null and non-null. I 
suspect that to be a clear winner.

> I think it belongs around 10 now anyway.

Yeah, I think that change is overdue given modern hardware characteristics 
(even with the current implementation).

Another idea could be to run a binary insertion sort and use a much higher 
threshold. This could significantly cut down on comparisons (especially with 
presorted runs, which are quite common in real workloads).

If full binary search turned out to be an issue regarding cache locality, we 
could do it in smaller chunks, e.g. do a micro binary search between the 
current position (I) and position minus chunk size (say 6-12 or so, whatever 
fits 1 or 2 cachelines) whenever A[I] < A[I-1] and if we don't find the 
position within that chunk we continue with the previous chunk, thereby 
preserving cache locality.

With less comparisons we should start keeping track of swaps and use that as an 
efficient way to determine presortedness. We could change the insertion sort 
threshold to a swap threshold and do insertion sort until we hit the swap 
threshold. I suspect that would make the current presorted check obsolete (as 
binary insertion sort without or even with a few swaps should be faster than 
the current presorted-check).

Cheers, Ben


Reply via email to