Well, if you can get rid of radix sort, do it. It's never going to be faster than it is on random data, and quicksort and mergesort should never be slower, as long as your pivot selection's good.
I'm very skeptical of Roger's 16-bit radix sorting though: it seems slower than 8-bit if you know all the tricks and is far less reliable because of the huge cache footprint. 8-bit radix needs to do all the counts in one pass and either vectorize or interleave the sums (and C compilers won't vectorize a prefix sum for you, shame). ska_sort_copy is where I learned this. Here's the 4-byte one: https://github.com/skarupke/ska_sort/blob/master/ska_sort.hpp#L212-L263 This easily beats any comparison sort I've seen on >100 random ints, while of course not being adaptive at all. For 8 bytes it's going to struggle. From a glance I'd expect your quicksort to be faster than the mergesort on anything with a reasonable AVX2 implementation. I don't know why either would change that much based on CPU though. Maybe the branches in merge sort have gotten relatively slower as other instructions speed up? Quicksort has only the one branch for 64 comparisons, looks like. Then the major looming issue is patterned data: many sorts have some patterns they handle really well and ideally you'd like to get all of them. Hybridization is hard work. I discuss this a lot in my sorting notes and have been trying to assemble the pieces in SingeliSort. https://mlochbaum.github.io/BQN/implementation/primitive/sort.html https://github.com/mlochbaum/SingeliSort Just out yesterday, glidesort has a method that I would describe as using mergesort as a flexible outer layer over quicksort. The mergesort only applies if there are natural runs. https://news.ycombinator.com/item?id=34646199 Marshall On Sat, Feb 04, 2023 at 10:49:26AM -0500, Henry Rich wrote: > This is so hard to measure! What's a reasonable reference system? > > The last time I measured, radixsort and mergesort had sizes where they were > better. Now, on my Alder Lake CPU, quicksort beats the lot, at all sizes, > integer and float. I don't see why it changed so much. > > There is still a sizable niche for small-range integer sort at all lengths, > but the range decreases with length. > > Henry Rich > > On 2/3/2023 9:59 PM, Marshall Lochbaum wrote: > > In case it helps, I took the following measurement with bencharray. > > Looks like the parts other than those plateaus in the middle have gotten > > faster since 9.03, although they were still steps up before. The input > > ranges from -2^31 to 1-~2^31. > > > > https://raw.githubusercontent.com/gist/mlochbaum/40952f8d28ef745dc2d83d8278908ca8/raw/sort-rand-i32-x-j.svg > > > > In the key, ∧ is /:~ and ⍋ is /: . Same for ∨ ⍒ in the other direction. > > > > Reproduce with https://github.com/mlochbaum/bencharray and the following > > command; requires Unix and a BQN install and takes a minute or two to > > run. Output in output/plot/sort-rand-i32-x-j.svg . > > > > $ ./benchmark.bqn all j sort-rand-i32 > > > > Marshall > > > > On Fri, Feb 03, 2023 at 06:08:32PM -0500, Henry Rich wrote: > > > Integer sort chooses between 4 algorithms based on timings we make every > > > now > > > and then. It's been a few years since we tuned it. > > > > > > 100000 integers or more uses mergesort; 50000-99999 uses radixsort. > > > > > > It looks like I'd better retune that. Thanks for the tip. > > > > > > Henry Rich > > > > > > On 2/3/2023 5:01 PM, vadim wrote: > > > > So I tried to begin to implement "split into small sub-tasks", but got > > > > stuck before any meaningful step, at what may be another version 9 > > > > issue (?) > > > > > > > > Though I remember " DO NOT give an x argument to 6!:2", in tests below > > > > there is x argument to catch solid time. Otherwise, I could cut 10^7 > > > > numbers into ~100 ranges to sort each (closer to my real task), then > > > > there > > > > will be no x argument, with approx. the same "solid" time difference > > > > caught > > > > (decimal point moved to the right 2 places of course), but cutting would > > > > mask the real issue. > > > > > > > > 100 (6!:2) '/:~ nn' [ nn =: ? N $ 10^9 [ N =: 100000 > > > > > > > > 0.00263482 > > > > 100 (6!:2) '/:~ nn' [ nn =: ? N $ 10^9 [ N =: 99999 > > > > > > > > 0.0073261 > > > > 100 (6!:2) '}./:~ nn,_1' [ nn =: ? N $ 10^9 [ N =: 99999 > > > > > > > > 0.00298486 > > > > > > > > If I reboot into Linux, it's not 3x, but 4x difference between tests 1 > > > > and > > > > 2. Test 3 is like ridiculous advice "don't sort 99999 (nor 99000, > > > > perhaps) > > > > numbers; instead pad with throw-away values to 10^5 numbers. I hope my > > > > result is not CPU-model specific, and you can observe something similar. > > > > > > > > No issue in 8.07. > > > > > > > > Best regards, > > > > Vadim > > > > > > > > > > > > On Fri, Jan 27, 2023 at 6:18 PM Henry Rich <henryhr...@gmail.com> wrote: > > > > > > > > > J9.04 is frozen for the release, but J9.05 will follow. This thread > > > > > has > > > > > convinced me that sorting large arrays can be greatly improved by > > > > > multithreading, and also that I can reduce the threading overhead. > > > > > > > > > > Henry Rich > > > > > > > > > > On 1/27/2023 6:07 AM, vadim wrote: > > > > > > > What result do you get when you use this syntax? > > > > > > "Don't use master thread, use worker threads only", correct? So, I'm > > > > > > allocating 3 workers for 4 equal pieces of work; 3 workers finish > > > > > > doing > > > > > > their task each, then 2 of them are idle, and 1 alone does the 4th > > > > > > part? > > > > > > > > > > > > Interestingly, with literals I get predictable time increase from > > > > > > ~400 ms > > > > > > to ~650 ms. Which concurs, more or less. With integers, I get the > > > > > > same > > > > > ~440 > > > > > > ms. How very strange. And it's the same ~420 ... 440 ms if I cut > > > > > > ints > > > > > into > > > > > > 2 or 3 parts (master thread is idle then). > > > > > > > > > > > > ------------- > > > > > > I don't know if "parallelizable primitives" (including "sort") will > > > > > arrive > > > > > > with 9.04 release, and maybe what follows was already mentioned in > > > > > > community, but with sorting literals there's effective and very > > > > > > lazy way > > > > > to > > > > > > gain speed through using threads as they are implemented now: > > > > > > > > > > > > (6!:2) '/:~ lits' > > > > > > > > > > > > 1.37036 > > > > > > > > > > > > (6!:2) '/:~ ;cut_by /:~t.(0$0);.0 lits' > > > > > > > > > > > > 0.665364 > > > > > > > > > > > > 1 T. '' > > > > > > > > > > > > 3 > > > > > > ------------- > > > > > > Henry, I only have shallow understanding, my practical point of > > > > > > view was > > > > > > "try to divide work into preferably large and easy to manage parts, > > > > > > then > > > > > > oversee the whole gang of workers are busy and none of them slacks". > > > > > Thank > > > > > > you very much for the detailed explanation, this cutting into very > > > > > > small > > > > > > pieces is new to me, I'll now see if I can implement it in practice. > > > > > > Sorting is only part of a larger whole; it's just that I noticed > > > > > > interesting results while experimenting. > > > > > > > > > > > > > > > > > > Best regards, > > > > > > Vadim > > > > > > > > > > > > > > > > > > On Fri, Jan 27, 2023 at 12:11 AM Ak O <akin...@gmail.com> wrote: > > > > > > > > > > > > > Hi Vadim, > > > > > > > > > > > > > > What result do you get when you use this syntax? > > > > > > > > > > > > > > ; cut_by /:~t.(<'worker';1);.0 ints > > > > > > > > > > > > > > Ak. > > > > > > > > > > > > > > > > > > > > > On Thu., Jan. 26, 2023, 08:46 Henry Rich, <henryhr...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > 1. Make sure you turn off all browsers etc. when you time > > > > > multrithreaded > > > > > > > > code. > > > > > > > > > > > > > > > > 2. DO NOT give an x argument to 6!:2. Repeated execution of > > > > > > > > the same > > > > > > > > sentence may tend to migrate values into the core they will be > > > > > > > > used in, > > > > > > > > which will underestimate the time required for a single run. > > > > > > > > If you > > > > > > > > have to time anything complicated you may have to put in your > > > > > > > > own > > > > > timing > > > > > > > > points. > > > > > > > > > > > > > > > > 3. 'Thread overhead' is more than just data movement. It > > > > > > > > includes time > > > > > > > > required to lock variable-name lookups and contention for D3$. > > > > > > > > The > > > > > sort > > > > > > > > of 8-byte literals uses a merge sort IIRC, while the sort of > > > > > > > > integers > > > > > > > > uses a lovingly-coded quicksort. The quicksort is faster, as > > > > > > > > you can > > > > > > > > see. Mergesort is very cache-friendly: it reads and writes to > > > > > > > > memory > > > > > > > > sequentially. Quicksort hops around, storing single words into > > > > > > > > two > > > > > > > > different streams. On a single thread that doesn't matter > > > > > > > > much, but > > > > > > > > with multiple threads and arguments far larger than D3$, the > > > > > > > > writing > > > > > > > > spills to DRAM: mergesort needs only one open DRAM page per > > > > > > > > thread > > > > > while > > > > > > > > quicksort needs two per thread. That will have to make a > > > > > > > > difference; > > > > > > > > how much I don't know. > > > > > > > > > > > > > > > > 4. If you want to split sorting into tasks, split it so that > > > > > > > > each task > > > > > > > > fits into D2$ of a core. With 10M numbers (80MB) and a machine > > > > > > > > with > > > > > 1MB > > > > > > > > D2$, let each task have 125K numbers. Experiment with smaller > > > > > > > > blocks. > > > > > > > > When you have the right size, the thread overhead will be > > > > > > > > comparable to > > > > > > > > the time spent making two copies of the whole array. > > > > > > > > > > > > > > > > 5. If you find that you can reduce the overhead that far, we > > > > > > > > could have > > > > > > > > a faster way to sort in J using multithreads: split the array > > > > > > > > into as > > > > > > > > many pieces as needed and sort them efficiently in threads, then > > > > > > > > transfer the data back to the master thread which merges the > > > > > > > > ordered > > > > > > > > strings. The transfer back to the master thread is > > > > > > > > unavoidable, but > > > > > the > > > > > > > > time required to merge (a reheaping operation) would be > > > > > > > > completely > > > > > > > > hidden under the transfer time. > > > > > > > > > > > > > > > > 6. Currently, starting a thread realizes any virtual argument > > > > > > > > in the > > > > > > > > master thread. It would be better to realize it in the worker > > > > > > > > thread > > > > > > > > where the data is needed. > > > > > > > > > > > > > > > > Henry Rich > > > > > > > > > > > > > > > > On 1/25/2023 6:11 PM, vadim wrote: > > > > > > > > > Thank you for fixing the issue and for explanation. I would > > > > > > > > > appreciate > > > > > > > it > > > > > > > > > if you could look into the performance issue I'm observing > > > > > > > > > with my > > > > > > > > > multi-threaded example (reduced to uselessness for sake of > > > > > > > presentation, > > > > > > > > > though). I will of course accept "that's how things are with > > > > > > > CPUs/memory" > > > > > > > > > and adjust expectations accordingly, and I understand results > > > > > > > > > depend > > > > > on > > > > > > > > > hardware and dataset very much. Too small a chance it's J > > > > > > > > > issue. > > > > > > > > > > > > > > > > > > The question concerns sorting data (I'm not doing the '+/' in > > > > > > > > > threads > > > > > > > > :)). > > > > > > > > > I have 4 cores CPU, so I start 3 additional threads: > > > > > > > > > > > > > > > > > > {{ for. i. 3 do. 0 T. 0 end. }} '' > > > > > > > > > > > > > > > > > > Suppose I have literal data, 10 million rows 8 bytes each: > > > > > > > > > > > > > > > > > > lits =: a. {~ ? ((1 * 10^7), 8) $ 256 > > > > > > > > > q =: >. -: -: # lits > > > > > > > > > cut_by =: q ,:~ "0 q * i. 4 > > > > > > > > > quarter =: q {. lits > > > > > > > > > > > > > > > > > > 4 (6!:2) '/:~ lits' > > > > > > > > > 1.37698 > > > > > > > > > 4 (6!:2) '/:~ quarter' > > > > > > > > > 0.308181 > > > > > > > > > 4 (6!:2) '; cut_by /:~t.(0$0);.0 lits' > > > > > > > > > 0.414603 > > > > > > > > > > > > > > > > > > Excellent, times I see match nicely, I understand there's > > > > > > > > > overhead > > > > > with > > > > > > > > > threads. Next, there are 10 millions of 8-byte integers: > > > > > > > > > > > > > > > > > > ints =: ? (1 * 10^7) $ (10^18) > > > > > > > > > q =: >. -: -: # ints > > > > > > > > > cut_by =: q ,:~ "0 q * i. 4 > > > > > > > > > quarter =: q {. ints > > > > > > > > > > > > > > > > > > 4 (6!:2) '/:~ ints' > > > > > > > > > 0.561057 > > > > > > > > > 4 (6!:2) '/:~ quarter' > > > > > > > > > 0.124735 > > > > > > > > > 4 (6!:2) '; cut_by /:~t.(0$0);.0 ints' > > > > > > > > > 0.441807 > > > > > > > > > > > > > > > > > > And I don't like this third time. There's roughly the same > > > > > > > > > amount of > > > > > > > data > > > > > > > > > for "threads overhead". My expectation for this result was > > > > > > > > > approx. > > > > > > > > 150-200 > > > > > > > > > ms or so. > > > > > > > > > > > > > > > > > > In fact, looking at 415 ms for literals and 442 ms for > > > > > > > > > numbers irked > > > > > me > > > > > > > > > into temptation: > > > > > > > > > > > > > > > > > > head =. (3&(3!:4) 16be2), ,|."1 (3&(3!:4)"0) 4,q,1,q > > > > > > > > > > > > > > > > > > _8 ]\ a. i. head > > > > > > > > > > > > > > > > > > 226 0 0 0 0 0 0 0 > > > > > > > > > > > > > > > > > > 0 0 0 0 0 0 0 4 > > > > > > > > > > > > > > > > > > 0 0 0 0 0 38 37 160 > > > > > > > > > > > > > > > > > > 0 0 0 0 0 0 0 1 > > > > > > > > > > > > > > > > > > 0 0 0 0 0 38 37 160 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > to_bytes =: 5&}. @: (_8&(]\)) @: (2&(3!:1)) > > > > > > > > > > > > > > > > > > from_bytes =: (3!:2) @: (head&,) @: , > > > > > > > > > > > > > > > > > > > > > > > > > > > (from_bytes /:~ to_bytes quarter) -: (/:~ quarter) NB. sane > > > > > > > > > still? > > > > > > > > > > > > > > > > > > 1 > > > > > > > > > > > > > > > > > > > > > > > > > > > 4 (6!:2) ';cut_by (from_bytes @: /:~ @: to_bytes)t.(0$0);.0 > > > > > > > > > ints' > > > > > > > > > > > > > > > > > > 0.51177 > > > > > > > > > > > > > > > > > > > > > > > > > > > Ah, it didn't work. But perhaps it could with certain data, > > > > > > > > > CPU model, > > > > > > > > > number of cores? So my question is if you could confirm that > > > > > > > > > it > > > > > (slower > > > > > > > > > than I expected speed with numerical sort in threads) is > > > > > > > > > neither J > > > > > > > issue, > > > > > > > > > nor 't.', nor '/:~'. Sorry if I wasted your time. > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > Vadim > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jan 25, 2023 at 8:02 PM Henry Rich > > > > > > > > > <henryhr...@gmail.com> > > > > > > > wrote: > > > > > > > > > > Fixed for the release. Thanks for the clear report. The > > > > > > > > > > problem was > > > > > > > > > > specific to the forms you mentioned. Workaround: use > > > > > > > > > > > > > > > > > > > > < @: ((+/) @:]) instead of < @: (+/) @:] > > > > > > > > > > > > > > > > > > > > The form <@:f is given special treatment. Your form was > > > > > > > > > > incorrectly > > > > > > > > > > being given that treatment. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If t. in cut is not meeting your expectations, perhaps you > > > > > > > > > > should > > > > > > > adjust > > > > > > > > > > your expectations. Verbs like (+/) will not benefit from > > > > > > > > > > threading > > > > > in > > > > > > > > > > most cases, and may slow down considerably. +/;.0 might be > > > > > > > > > > even > > > > > > > worse. > > > > > > > > > > Why? Because +/ is totally limited by the speed of reading > > > > > > > > > > from > > > > > > > > > > memory. If the data fits in level-2 data cache (D2$) many > > > > > > > > > > cores are > > > > > > > no > > > > > > > > > > faster than one. > > > > > > > > > > > > > > > > > > > > In fact they are much slower, because only one core has the > > > > > > > > > > data in > > > > > > > > > > D2$. The rest have to transfer the data from the bottom of > > > > > > > > > > the ocean > > > > > > > > > > (i. e. from the core with the data through D3$) or from the > > > > > > > > > > moon > > > > > > > > > > (SDRAM). They are spending their time waiting for > > > > > > > > > > responses from > > > > > > > > memory. > > > > > > > > > > +/;.0 creates a virtual block for each section and passes > > > > > > > > > > that to +/ > > > > > . > > > > > > > > > > There is no need to move the data except for the reading > > > > > > > > > > required by > > > > > > > +/ > > > > > > > > > > . If you run the +/ in a thread, the virtual block must be > > > > > > > > > > realized > > > > > > > > > > with an explicit copy from the bottom of the ocean. That > > > > > > > > > > doesn't add > > > > > > > > > > much, because once the copy is made the data will be in D2$ > > > > > > > > > > of the > > > > > > > > > > receiving core, but it is a small slowdown. > > > > > > > > > > > > > > > > > > > > A thread needs to be able to run in its own core until it > > > > > > > > > > has done > > > > > > > > > > reads+writes to D1$/D2$ at least, say, 100 times the size > > > > > > > > > > of its > > > > > > > > > > arguments+result. +/ . * is a perfect example. On large > > > > > > > > > > matrices > > > > > the > > > > > > > > > > arguments are cycled through repeatedly. > > > > > > > > > > > > > > > > > > > > Henry Rich > > > > > > > > > > > > > > > > > > > > On 1/25/2023 7:08 AM, vadim wrote: > > > > > > > > > > > Hi, please consider this: > > > > > > > > > > > > > > > > > > > > > > ((0,:2),:(2,:2)) (< @: +: @: ]);.0 i. 4 > > > > > > > > > > > +---+---+ > > > > > > > > > > > |0 1|2 3| > > > > > > > > > > > +---+---+ > > > > > > > > > > > ((0,:2),:(2,:2)) (< @: (+/) @: ]);.0 i. 4 > > > > > > > > > > > +---+---+ > > > > > > > > > > > |0 1|2 3| > > > > > > > > > > > +---+---+ > > > > > > > > > > > ((0,:2),:(2,:2)) (< @: (\:~) @: ]);.0 i. 4 > > > > > > > > > > > +---+---+ > > > > > > > > > > > |0 1|2 3| > > > > > > > > > > > +---+---+ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > No issues in 8.07; and a bug (that's what I'd call it) in > > > > > > > > > > > 9.03 and > > > > > > > > 9.04. > > > > > > > > > > > Looks like it happens if the left arg has multiple > > > > > > > > > > > ranges; and a > > > > > verb > > > > > > > > to > > > > > > > > > > > apply is composed with "same" and "box" verbs as first > > > > > > > > > > > and last in > > > > > > > > > > > sequence. But it's at 1st glance only. Sometimes omitting > > > > > parentheses > > > > > > > > > would > > > > > > > > > > > help (which clearly means parsing issue?). All these > > > > > > > > > > > produce > > > > > expected > > > > > > > > > > > output: > > > > > > > > > > > > > > > > > > > > > > (2,:2) (< @: +: @: ]);.0 i. 4 > > > > > > > > > > > +---+ > > > > > > > > > > > |4 6| > > > > > > > > > > > +---+ > > > > > > > > > > > ((0,:2),:(2,:2)) (] @: +: @: ]);.0 i. 4 > > > > > > > > > > > 0 2 > > > > > > > > > > > 4 6 > > > > > > > > > > > ((0,:2),:(2,:2)) (< @: +/ @: ]);.0 i. 4 > > > > > > > > > > > +-+-+ > > > > > > > > > > > |1|5| > > > > > > > > > > > +-+-+ > > > > > > > > > > > (0,:2) (< @: (\:~) @: ]);.0 i. 4 > > > > > > > > > > > +---+ > > > > > > > > > > > |1 0| > > > > > > > > > > > +---+ > > > > > > > > > > > ((0,:2),:(2,:2)) (] @: (\:~) @: ]);.0 i. 4 > > > > > > > > > > > 1 0 > > > > > > > > > > > 3 2 > > > > > > > > > > > ((0,:2),:(2,:2)) (< @: (\:~));.0 i. 4 > > > > > > > > > > > +---+---+ > > > > > > > > > > > |1 0|3 2| > > > > > > > > > > > +---+---+ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > While "why would you want to use the ']' here?" would be > > > > > > > > > > > reasonable > > > > > > > to > > > > > > > > > ask, > > > > > > > > > > > but, in the end, syntax is either correct or not. In > > > > > > > > > > > fact, I was > > > > > > > > testing > > > > > > > > > > > all kinds of different constructs investigating why > > > > > > > > > > > multi-threaded > > > > > > > > > > > operation (with "t.") on subarrays is so much slower than > > > > > > > > > > > expected, > > > > > > > > > > > although it's perhaps totally unrelated to what's > > > > > > > > > > > discovered above. > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > Vadim > > > > > > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > > > > > > > For information about J forums see > > > > > > > http://www.jsoftware.com/forums.htm > > > > > ---------------------------------------------------------------------- > > > > > > > > > > For information about J forums see > > > > > > > http://www.jsoftware.com/forums.htm > > > > > > > > > ---------------------------------------------------------------------- > > > > > > > > > For information about J forums see > > > > > http://www.jsoftware.com/forums.htm > > > > > > > > ---------------------------------------------------------------------- > > > > > > > > For information about J forums see > > > > > > > > http://www.jsoftware.com/forums.htm > > > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > > > For information about J forums see > > > > > > > http://www.jsoftware.com/forums.htm > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > > For information about J forums see > > > > > > http://www.jsoftware.com/forums.htm > > > > > ---------------------------------------------------------------------- > > > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > > > > > > ---------------------------------------------------------------------- > > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm