This is so hard to measure! What's a reasonable reference system?
The last time I measured, radixsort and mergesort had sizes where they were
better. Now, on my Alder Lake CPU, quicksort beats the lot, at all sizes,
integer and float. I don't see why it changed so much.
There is still a sizable niche for small-range integer sort at all lengths,
but the range decreases with length.
Henry Rich
On 2/3/2023 9:59 PM, Marshall Lochbaum wrote:
> In case it helps, I took the following measurement with bencharray.
> Looks like the parts other than those plateaus in the middle have gotten
> faster since 9.03, although they were still steps up before. The input
> ranges from -2^31 to 1-~2^31.
>
> https://raw.githubusercontent.com/gist/mlochbaum/40952f8d28ef745dc2d83d8278908ca8/raw/sort-rand-i32-x-j.svg
>
> In the key, ∧ is /:~ and ⍋ is /: . Same for ∨ ⍒ in the other direction.
>
> Reproduce with https://github.com/mlochbaum/bencharray and the following
> command; requires Unix and a BQN install and takes a minute or two to
> run. Output in output/plot/sort-rand-i32-x-j.svg .
>
> $ ./benchmark.bqn all j sort-rand-i32
>
> Marshall
>
> On Fri, Feb 03, 2023 at 06:08:32PM -0500, Henry Rich wrote:
> > Integer sort chooses between 4 algorithms based on timings we make every now
> > and then. It's been a few years since we tuned it.
> >
> > 100000 integers or more uses mergesort; 50000-99999 uses radixsort.
> >
> > It looks like I'd better retune that. Thanks for the tip.
> >
> > Henry Rich
> >
> > On 2/3/2023 5:01 PM, vadim wrote:
> > > So I tried to begin to implement "split into small sub-tasks", but got
> > > stuck before any meaningful step, at what may be another version 9 issue
(?)
> > >
> > > Though I remember " DO NOT give an x argument to 6!:2", in tests below
> > > there is x argument to catch solid time. Otherwise, I could cut 10^7
> > > numbers into ~100 ranges to sort each (closer to my real task), then there
> > > will be no x argument, with approx. the same "solid" time difference
caught
> > > (decimal point moved to the right 2 places of course), but cutting would
> > > mask the real issue.
> > >
> > > 100 (6!:2) '/:~ nn' [ nn =: ? N $ 10^9 [ N =: 100000
> > >
> > > 0.00263482
> > > 100 (6!:2) '/:~ nn' [ nn =: ? N $ 10^9 [ N =: 99999
> > >
> > > 0.0073261
> > > 100 (6!:2) '}./:~ nn,_1' [ nn =: ? N $ 10^9 [ N =: 99999
> > >
> > > 0.00298486
> > >
> > > If I reboot into Linux, it's not 3x, but 4x difference between tests 1 and
> > > 2. Test 3 is like ridiculous advice "don't sort 99999 (nor 99000, perhaps)
> > > numbers; instead pad with throw-away values to 10^5 numbers. I hope my
> > > result is not CPU-model specific, and you can observe something similar.
> > >
> > > No issue in 8.07.
> > >
> > > Best regards,
> > > Vadim
> > >
> > >
> > > On Fri, Jan 27, 2023 at 6:18 PM Henry Rich <henryhr...@gmail.com> wrote:
> > >
> > > > J9.04 is frozen for the release, but J9.05 will follow. This thread has
> > > > convinced me that sorting large arrays can be greatly improved by
> > > > multithreading, and also that I can reduce the threading overhead.
> > > >
> > > > Henry Rich
> > > >
> > > > On 1/27/2023 6:07 AM, vadim wrote:
> > > > > > What result do you get when you use this syntax?
> > > > > "Don't use master thread, use worker threads only", correct? So, I'm
> > > > > allocating 3 workers for 4 equal pieces of work; 3 workers finish
doing
> > > > > their task each, then 2 of them are idle, and 1 alone does the 4th
part?
> > > > >
> > > > > Interestingly, with literals I get predictable time increase from ~400 ms
> > > > > to ~650 ms. Which concurs, more or less. With integers, I get the same
> > > > ~440
> > > > > ms. How very strange. And it's the same ~420 ... 440 ms if I cut ints
> > > > into
> > > > > 2 or 3 parts (master thread is idle then).
> > > > >
> > > > > -------------
> > > > > I don't know if "parallelizable primitives" (including "sort") will
> > > > arrive
> > > > > with 9.04 release, and maybe what follows was already mentioned in
> > > > > community, but with sorting literals there's effective and very lazy
way
> > > > to
> > > > > gain speed through using threads as they are implemented now:
> > > > >
> > > > > (6!:2) '/:~ lits'
> > > > >
> > > > > 1.37036
> > > > >
> > > > > (6!:2) '/:~ ;cut_by /:~t.(0$0);.0 lits'
> > > > >
> > > > > 0.665364
> > > > >
> > > > > 1 T. ''
> > > > >
> > > > > 3
> > > > > -------------
> > > > > Henry, I only have shallow understanding, my practical point of view
was
> > > > > "try to divide work into preferably large and easy to manage parts,
then
> > > > > oversee the whole gang of workers are busy and none of them slacks".
> > > > Thank
> > > > > you very much for the detailed explanation, this cutting into very
small
> > > > > pieces is new to me, I'll now see if I can implement it in practice.
> > > > > Sorting is only part of a larger whole; it's just that I noticed
> > > > > interesting results while experimenting.
> > > > >
> > > > >
> > > > > Best regards,
> > > > > Vadim
> > > > >
> > > > >
> > > > > On Fri, Jan 27, 2023 at 12:11 AM Ak O <akin...@gmail.com> wrote:
> > > > >
> > > > > > Hi Vadim,
> > > > > >
> > > > > > What result do you get when you use this syntax?
> > > > > >
> > > > > > ; cut_by /:~t.(<'worker';1);.0 ints
> > > > > >
> > > > > > Ak.
> > > > > >
> > > > > >
> > > > > > On Thu., Jan. 26, 2023, 08:46 Henry Rich, <henryhr...@gmail.com> wrote:
> > > > > >
> > > > > > > 1. Make sure you turn off all browsers etc. when you time
> > > > multrithreaded
> > > > > > > code.
> > > > > > >
> > > > > > > 2. DO NOT give an x argument to 6!:2. Repeated execution of the same
> > > > > > > sentence may tend to migrate values into the core they will be
used in,
> > > > > > > which will underestimate the time required for a single run. If
you
> > > > > > > have to time anything complicated you may have to put in your own
> > > > timing
> > > > > > > points.
> > > > > > >
> > > > > > > 3. 'Thread overhead' is more than just data movement. It includes time
> > > > > > > required to lock variable-name lookups and contention for D3$.
The
> > > > sort
> > > > > > > of 8-byte literals uses a merge sort IIRC, while the sort of
integers
> > > > > > > uses a lovingly-coded quicksort. The quicksort is faster, as you
can
> > > > > > > see. Mergesort is very cache-friendly: it reads and writes to
memory
> > > > > > > sequentially. Quicksort hops around, storing single words into two
> > > > > > > different streams. On a single thread that doesn't matter much,
but
> > > > > > > with multiple threads and arguments far larger than D3$, the
writing
> > > > > > > spills to DRAM: mergesort needs only one open DRAM page per thread
> > > > while
> > > > > > > quicksort needs two per thread. That will have to make a
difference;
> > > > > > > how much I don't know.
> > > > > > >
> > > > > > > 4. If you want to split sorting into tasks, split it so that each task
> > > > > > > fits into D2$ of a core. With 10M numbers (80MB) and a machine
with
> > > > 1MB
> > > > > > > D2$, let each task have 125K numbers. Experiment with smaller
blocks.
> > > > > > > When you have the right size, the thread overhead will be
comparable to
> > > > > > > the time spent making two copies of the whole array.
> > > > > > >
> > > > > > > 5. If you find that you can reduce the overhead that far, we could have
> > > > > > > a faster way to sort in J using multithreads: split the array
into as
> > > > > > > many pieces as needed and sort them efficiently in threads, then
> > > > > > > transfer the data back to the master thread which merges the
ordered
> > > > > > > strings. The transfer back to the master thread is unavoidable,
but
> > > > the
> > > > > > > time required to merge (a reheaping operation) would be completely
> > > > > > > hidden under the transfer time.
> > > > > > >
> > > > > > > 6. Currently, starting a thread realizes any virtual argument in the
> > > > > > > master thread. It would be better to realize it in the worker
thread
> > > > > > > where the data is needed.
> > > > > > >
> > > > > > > Henry Rich
> > > > > > >
> > > > > > > On 1/25/2023 6:11 PM, vadim wrote:
> > > > > > > > Thank you for fixing the issue and for explanation. I would
appreciate
> > > > > > it
> > > > > > > > if you could look into the performance issue I'm observing with
my
> > > > > > > > multi-threaded example (reduced to uselessness for sake of
> > > > > > presentation,
> > > > > > > > though). I will of course accept "that's how things are with
> > > > > > CPUs/memory"
> > > > > > > > and adjust expectations accordingly, and I understand results
depend
> > > > on
> > > > > > > > hardware and dataset very much. Too small a chance it's J issue.
> > > > > > > >
> > > > > > > > The question concerns sorting data (I'm not doing the '+/' in threads
> > > > > > > :)).
> > > > > > > > I have 4 cores CPU, so I start 3 additional threads:
> > > > > > > >
> > > > > > > > {{ for. i. 3 do. 0 T. 0 end. }} ''
> > > > > > > >
> > > > > > > > Suppose I have literal data, 10 million rows 8 bytes each:
> > > > > > > >
> > > > > > > > lits =: a. {~ ? ((1 * 10^7), 8) $ 256
> > > > > > > > q =: >. -: -: # lits
> > > > > > > > cut_by =: q ,:~ "0 q * i. 4
> > > > > > > > quarter =: q {. lits
> > > > > > > >
> > > > > > > > 4 (6!:2) '/:~ lits'
> > > > > > > > 1.37698
> > > > > > > > 4 (6!:2) '/:~ quarter'
> > > > > > > > 0.308181
> > > > > > > > 4 (6!:2) '; cut_by /:~t.(0$0);.0 lits'
> > > > > > > > 0.414603
> > > > > > > >
> > > > > > > > Excellent, times I see match nicely, I understand there's overhead
> > > > with
> > > > > > > > threads. Next, there are 10 millions of 8-byte integers:
> > > > > > > >
> > > > > > > > ints =: ? (1 * 10^7) $ (10^18)
> > > > > > > > q =: >. -: -: # ints
> > > > > > > > cut_by =: q ,:~ "0 q * i. 4
> > > > > > > > quarter =: q {. ints
> > > > > > > >
> > > > > > > > 4 (6!:2) '/:~ ints'
> > > > > > > > 0.561057
> > > > > > > > 4 (6!:2) '/:~ quarter'
> > > > > > > > 0.124735
> > > > > > > > 4 (6!:2) '; cut_by /:~t.(0$0);.0 ints'
> > > > > > > > 0.441807
> > > > > > > >
> > > > > > > > And I don't like this third time. There's roughly the same amount of
> > > > > > data
> > > > > > > > for "threads overhead". My expectation for this result was
approx.
> > > > > > > 150-200
> > > > > > > > ms or so.
> > > > > > > >
> > > > > > > > In fact, looking at 415 ms for literals and 442 ms for numbers irked
> > > > me
> > > > > > > > into temptation:
> > > > > > > >
> > > > > > > > head =. (3&(3!:4) 16be2), ,|."1 (3&(3!:4)"0) 4,q,1,q
> > > > > > > >
> > > > > > > > _8 ]\ a. i. head
> > > > > > > >
> > > > > > > > 226 0 0 0 0 0 0 0
> > > > > > > >
> > > > > > > > 0 0 0 0 0 0 0 4
> > > > > > > >
> > > > > > > > 0 0 0 0 0 38 37 160
> > > > > > > >
> > > > > > > > 0 0 0 0 0 0 0 1
> > > > > > > >
> > > > > > > > 0 0 0 0 0 38 37 160
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > to_bytes =: 5&}. @: (_8&(]\)) @: (2&(3!:1))
> > > > > > > >
> > > > > > > > from_bytes =: (3!:2) @: (head&,) @: ,
> > > > > > > >
> > > > > > > >
> > > > > > > > (from_bytes /:~ to_bytes quarter) -: (/:~ quarter) NB. sane still?
> > > > > > > >
> > > > > > > > 1
> > > > > > > >
> > > > > > > >
> > > > > > > > 4 (6!:2) ';cut_by (from_bytes @: /:~ @: to_bytes)t.(0$0);.0 ints'
> > > > > > > >
> > > > > > > > 0.51177
> > > > > > > >
> > > > > > > >
> > > > > > > > Ah, it didn't work. But perhaps it could with certain data, CPU model,
> > > > > > > > number of cores? So my question is if you could confirm that it
> > > > (slower
> > > > > > > > than I expected speed with numerical sort in threads) is
neither J
> > > > > > issue,
> > > > > > > > nor 't.', nor '/:~'. Sorry if I wasted your time.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Vadim
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Jan 25, 2023 at 8:02 PM Henry Rich <henryhr...@gmail.com>
> > > > > > wrote:
> > > > > > > > > Fixed for the release. Thanks for the clear report. The
problem was
> > > > > > > > > specific to the forms you mentioned. Workaround: use
> > > > > > > > >
> > > > > > > > > < @: ((+/) @:]) instead of < @: (+/) @:]
> > > > > > > > >
> > > > > > > > > The form <@:f is given special treatment. Your form was incorrectly
> > > > > > > > > being given that treatment.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > If t. in cut is not meeting your expectations, perhaps you should
> > > > > > adjust
> > > > > > > > > your expectations. Verbs like (+/) will not benefit from
threading
> > > > in
> > > > > > > > > most cases, and may slow down considerably. +/;.0 might be
even
> > > > > > worse.
> > > > > > > > > Why? Because +/ is totally limited by the speed of reading
from
> > > > > > > > > memory. If the data fits in level-2 data cache (D2$) many
cores are
> > > > > > no
> > > > > > > > > faster than one.
> > > > > > > > >
> > > > > > > > > In fact they are much slower, because only one core has the data in
> > > > > > > > > D2$. The rest have to transfer the data from the bottom of
the ocean
> > > > > > > > > (i. e. from the core with the data through D3$) or from the
moon
> > > > > > > > > (SDRAM). They are spending their time waiting for responses
from
> > > > > > > memory.
> > > > > > > > > +/;.0 creates a virtual block for each section and passes
that to +/
> > > > .
> > > > > > > > > There is no need to move the data except for the reading
required by
> > > > > > +/
> > > > > > > > > . If you run the +/ in a thread, the virtual block must be
realized
> > > > > > > > > with an explicit copy from the bottom of the ocean. That
doesn't add
> > > > > > > > > much, because once the copy is made the data will be in D2$
of the
> > > > > > > > > receiving core, but it is a small slowdown.
> > > > > > > > >
> > > > > > > > > A thread needs to be able to run in its own core until it has done
> > > > > > > > > reads+writes to D1$/D2$ at least, say, 100 times the size of
its
> > > > > > > > > arguments+result. +/ . * is a perfect example. On large
matrices
> > > > the
> > > > > > > > > arguments are cycled through repeatedly.
> > > > > > > > >
> > > > > > > > > Henry Rich
> > > > > > > > >
> > > > > > > > > On 1/25/2023 7:08 AM, vadim wrote:
> > > > > > > > > > Hi, please consider this:
> > > > > > > > > >
> > > > > > > > > > ((0,:2),:(2,:2)) (< @: +: @: ]);.0 i. 4
> > > > > > > > > > +---+---+
> > > > > > > > > > |0 1|2 3|
> > > > > > > > > > +---+---+
> > > > > > > > > > ((0,:2),:(2,:2)) (< @: (+/) @: ]);.0 i. 4
> > > > > > > > > > +---+---+
> > > > > > > > > > |0 1|2 3|
> > > > > > > > > > +---+---+
> > > > > > > > > > ((0,:2),:(2,:2)) (< @: (\:~) @: ]);.0 i. 4
> > > > > > > > > > +---+---+
> > > > > > > > > > |0 1|2 3|
> > > > > > > > > > +---+---+
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > No issues in 8.07; and a bug (that's what I'd call it) in 9.03 and
> > > > > > > 9.04.
> > > > > > > > > > Looks like it happens if the left arg has multiple ranges;
and a
> > > > verb
> > > > > > > to
> > > > > > > > > > apply is composed with "same" and "box" verbs as first and
last in
> > > > > > > > > > sequence. But it's at 1st glance only. Sometimes omitting
> > > > parentheses
> > > > > > > > would
> > > > > > > > > > help (which clearly means parsing issue?). All these produce
> > > > expected
> > > > > > > > > > output:
> > > > > > > > > >
> > > > > > > > > > (2,:2) (< @: +: @: ]);.0 i. 4
> > > > > > > > > > +---+
> > > > > > > > > > |4 6|
> > > > > > > > > > +---+
> > > > > > > > > > ((0,:2),:(2,:2)) (] @: +: @: ]);.0 i. 4
> > > > > > > > > > 0 2
> > > > > > > > > > 4 6
> > > > > > > > > > ((0,:2),:(2,:2)) (< @: +/ @: ]);.0 i. 4
> > > > > > > > > > +-+-+
> > > > > > > > > > |1|5|
> > > > > > > > > > +-+-+
> > > > > > > > > > (0,:2) (< @: (\:~) @: ]);.0 i. 4
> > > > > > > > > > +---+
> > > > > > > > > > |1 0|
> > > > > > > > > > +---+
> > > > > > > > > > ((0,:2),:(2,:2)) (] @: (\:~) @: ]);.0 i. 4
> > > > > > > > > > 1 0
> > > > > > > > > > 3 2
> > > > > > > > > > ((0,:2),:(2,:2)) (< @: (\:~));.0 i. 4
> > > > > > > > > > +---+---+
> > > > > > > > > > |1 0|3 2|
> > > > > > > > > > +---+---+
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > While "why would you want to use the ']' here?" would be reasonable
> > > > > > to
> > > > > > > > ask,
> > > > > > > > > > but, in the end, syntax is either correct or not. In fact,
I was
> > > > > > > testing
> > > > > > > > > > all kinds of different constructs investigating why
multi-threaded
> > > > > > > > > > operation (with "t.") on subarrays is so much slower than
expected,
> > > > > > > > > > although it's perhaps totally unrelated to what's
discovered above.
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Vadim
> > > > > > > > > >
> > > > > > ----------------------------------------------------------------------
> > > > > > > > > > For information about J forums see
> > > > > > http://www.jsoftware.com/forums.htm
> > > > ----------------------------------------------------------------------
> > > > > > > > > For information about J forums see
> > > > > > http://www.jsoftware.com/forums.htm
> > > > > > > >
----------------------------------------------------------------------
> > > > > > > > For information about J forums see
> > > > http://www.jsoftware.com/forums.htm
> > > > > > >
----------------------------------------------------------------------
> > > > > > > For information about J forums see
http://www.jsoftware.com/forums.htm
> > > > > > >
> > > > > > ----------------------------------------------------------------------
> > > > > > For information about J forums see
http://www.jsoftware.com/forums.htm
> > > > > >
> > > > > ----------------------------------------------------------------------
> > > > > For information about J forums see http://www.jsoftware.com/forums.htm
> > > > ----------------------------------------------------------------------
> > > > For information about J forums see http://www.jsoftware.com/forums.htm
> > > >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm