Re: [Jprogramming] Version 9 sorting anomaly? (was: Re: Unexpected results (a bug?) in version 9 with some cases of cut (subarray))

Henry Rich Sat, 04 Feb 2023 07:49:41 -0800

This is so hard to measure!  What's a reasonable reference system?

The last time I measured, radixsort and mergesort had sizes where theywere better. Now, on my Alder Lake CPU, quicksort beats the lot, at allsizes, integer and float. I don't see why it changed so much.

There is still a sizable niche for small-range integer sort at alllengths, but the range decreases with length.


Henry Rich

On 2/3/2023 9:59 PM, Marshall Lochbaum wrote:

In case it helps, I took the following measurement with bencharray.
Looks like the parts other than those plateaus in the middle have gotten
faster since 9.03, although they were still steps up before. The input
ranges from -2^31 to 1-~2^31.

https://raw.githubusercontent.com/gist/mlochbaum/40952f8d28ef745dc2d83d8278908ca8/raw/sort-rand-i32-x-j.svg

In the key, ∧ is /:~ and ⍋ is /: . Same for ∨ ⍒ in the other direction.

Reproduce with https://github.com/mlochbaum/bencharray and the following
command; requires Unix and a BQN install and takes a minute or two to
run. Output in output/plot/sort-rand-i32-x-j.svg .

$ ./benchmark.bqn all j sort-rand-i32

Marshall

On Fri, Feb 03, 2023 at 06:08:32PM -0500, Henry Rich wrote:

Integer sort chooses between 4 algorithms based on timings we make every now
and then.  It's been a few years since we tuned it.

100000 integers or more uses mergesort; 50000-99999 uses radixsort.

It looks like I'd better retune that.  Thanks for the tip.

Henry Rich

On 2/3/2023 5:01 PM, vadim wrote:

So I tried to begin to implement "split into small sub-tasks", but got
stuck before any meaningful step, at what may be another version 9 issue (?)

Though I remember " DO NOT give an x argument to 6!:2", in tests below
there is x argument to catch solid time. Otherwise, I could cut 10^7
numbers into ~100 ranges to sort each (closer to my real task), then there
will be no x argument, with approx. the same "solid" time difference caught
(decimal point moved to the right 2 places of course), but cutting would
mask the real issue.

100 (6!:2) '/:~ nn' [ nn =: ? N $ 10^9 [ N =: 100000

0.00263482
100 (6!:2) '/:~ nn' [ nn =: ? N $ 10^9 [ N =: 99999

0.0073261
100 (6!:2) '}./:~ nn,_1' [ nn =: ? N $ 10^9 [ N =: 99999

0.00298486

If I reboot into Linux, it's not 3x, but 4x difference between tests 1 and
2. Test 3 is like ridiculous advice "don't sort 99999 (nor 99000, perhaps)
numbers; instead pad with throw-away values to 10^5 numbers. I hope my
result is not CPU-model specific, and you can observe something similar.

No issue in 8.07.

Best regards,
Vadim


On Fri, Jan 27, 2023 at 6:18 PM Henry Rich <henryhr...@gmail.com> wrote:

J9.04 is frozen for the release, but J9.05 will follow.  This thread has
convinced me that sorting large arrays can be greatly improved by
multithreading, and also that I can reduce the threading overhead.

Henry Rich

On 1/27/2023 6:07 AM, vadim wrote:

What result do you get when you use this syntax?

"Don't use master thread, use worker threads only", correct? So, I'm
allocating 3 workers for 4 equal pieces of work; 3 workers finish doing
their task each, then 2 of them are idle, and 1 alone does the 4th part?

Interestingly, with literals I get predictable time increase from ~400 ms
to ~650 ms. Which concurs, more or less. With integers, I get the same

~440

ms. How very strange. And it's the same ~420 ... 440 ms if I cut ints

into

2 or 3 parts (master thread is idle then).

-------------
I don't know if "parallelizable primitives" (including "sort") will

arrive

with 9.04 release, and maybe what follows was already mentioned in
community, but with sorting literals there's effective and very lazy way

to

gain speed through using threads as they are implemented now:

(6!:2) '/:~ lits'

1.37036

(6!:2) '/:~ ;cut_by /:~t.(0$0);.0 lits'

0.665364

1 T. ''

3
-------------
Henry, I only have shallow understanding, my practical point of view was
"try to divide work into preferably large and easy to manage parts, then
oversee the whole gang of workers are busy and none of them slacks".

Thank

you very much for the detailed explanation, this cutting into very small
pieces is new to me, I'll now see if I can implement it in practice.
Sorting is only part of a larger whole; it's just that I noticed
interesting results while experimenting.


Best regards,
Vadim


On Fri, Jan 27, 2023 at 12:11 AM Ak O <akin...@gmail.com> wrote:

Hi Vadim,

What result do you get when you use this syntax?

        ; cut_by /:~t.(<'worker';1);.0 ints

Ak.


On Thu., Jan. 26, 2023, 08:46 Henry Rich, <henryhr...@gmail.com> wrote:

1. Make sure you turn off all browsers etc. when you time

multrithreaded

code.

2. DO NOT give an x argument to 6!:2.  Repeated execution of the same
sentence may tend to migrate values into the core they will be used in,
which will underestimate the time required for a single run.  If you
have to time anything complicated you may have to put in your own

timing

points.

3. 'Thread overhead' is more than just data movement.  It includes time
required to lock variable-name lookups and contention for D3$.  The

sort

of 8-byte literals uses a merge sort IIRC, while the sort of integers
uses a lovingly-coded quicksort.  The quicksort is faster, as you can
see.  Mergesort is very cache-friendly: it reads and writes to memory
sequentially. Quicksort hops around, storing single words into two
different streams.  On a single thread that doesn't matter much, but
with multiple threads and arguments far larger than D3$, the writing
spills to DRAM: mergesort needs only one open DRAM page per thread

while

quicksort needs two per thread.  That will have to make a difference;
how much I don't know.

4. If you want to split sorting into tasks, split it so that each task
fits into D2$ of a core.  With 10M numbers (80MB) and a machine with

1MB

D2$, let each task have 125K numbers.  Experiment with smaller blocks.
When you have the right size, the thread overhead will be comparable to
the time spent making two copies of the whole array.

5. If you find that you can reduce the overhead that far, we could have
a faster way to sort in J using multithreads: split the array into as
many pieces as needed and sort them efficiently in threads, then
transfer the data back to the master thread which merges the ordered
strings.  The transfer back to the master thread is unavoidable, but

the

time required to merge (a reheaping operation) would be completely
hidden under the transfer time.

6. Currently, starting a thread realizes any virtual argument in the
master thread.  It would be better to realize it in the worker thread
where the data is needed.

Henry Rich

On 1/25/2023 6:11 PM, vadim wrote:

Thank you for fixing the issue and for explanation. I would appreciate

it

if you could look into the performance issue I'm observing with my
multi-threaded example (reduced to uselessness for sake of

presentation,

though). I will of course accept "that's how things are with

CPUs/memory"

and adjust expectations accordingly, and I understand results depend

on

hardware and dataset very much. Too small a chance it's J issue.

The question concerns sorting data (I'm not doing the '+/' in threads

:)).

I have 4 cores CPU, so I start 3 additional threads:

{{ for. i. 3 do. 0 T. 0 end. }} ''

Suppose I have literal data, 10 million rows 8 bytes each:

lits =: a. {~ ? ((1 * 10^7), 8) $ 256
q =: >. -: -: # lits
cut_by =: q ,:~ "0 q * i. 4
quarter =: q {. lits

       4 (6!:2) '/:~ lits'
1.37698
       4 (6!:2) '/:~ quarter'
0.308181
       4 (6!:2) '; cut_by /:~t.(0$0);.0 lits'
0.414603

Excellent, times I see match nicely, I understand there's overhead

with

threads. Next, there are 10 millions of 8-byte integers:

ints =: ? (1 * 10^7) $ (10^18)
q =: >. -: -: # ints
cut_by =: q ,:~ "0 q * i. 4
quarter =: q {. ints

       4 (6!:2) '/:~ ints'
0.561057
       4 (6!:2) '/:~ quarter'
0.124735
       4 (6!:2) '; cut_by /:~t.(0$0);.0 ints'
0.441807

And I don't like this third time. There's roughly the same amount of

data

for "threads overhead". My expectation for this result was approx.

150-200

ms or so.

In fact, looking at 415 ms for literals and 442 ms for numbers irked

me

into temptation:

head =. (3&(3!:4) 16be2), ,|."1 (3&(3!:4)"0) 4,q,1,q

_8 ]\ a. i. head

226 0 0 0 0 0 0 0

0 0 0 0 0 0 0 4

0 0 0 0 0 38 37 160

0 0 0 0 0 0 0 1

0 0 0 0 0 38 37 160



to_bytes =: 5&}. @: (_8&(]\)) @: (2&(3!:1))

from_bytes =: (3!:2) @: (head&,) @: ,


(from_bytes /:~ to_bytes quarter) -: (/:~ quarter) NB. sane still?

1


4 (6!:2) ';cut_by (from_bytes @: /:~ @: to_bytes)t.(0$0);.0 ints'

0.51177


Ah, it didn't work. But perhaps it could with certain data, CPU model,
number of cores? So my question is if you could confirm that it

(slower

than I expected speed with numerical sort in threads) is neither J

issue,

nor 't.', nor '/:~'. Sorry if I wasted your time.

Best regards,
Vadim


On Wed, Jan 25, 2023 at 8:02 PM Henry Rich <henryhr...@gmail.com>

wrote:

Fixed for the release.  Thanks for the clear report.  The problem was
specific to the forms you mentioned. Workaround: use

< @: ((+/) @:])   instead of   < @: (+/) @:]

The form <@:f is given special treatment.  Your form was incorrectly
being given that treatment.


If t. in cut is not meeting your expectations, perhaps you should

adjust

your expectations.  Verbs like (+/) will not benefit from threading

in

most cases, and may slow down considerably.  +/;.0 might be even

worse.

Why?  Because +/ is totally limited by the speed of reading from
memory.  If the data fits in level-2 data cache (D2$) many cores are

no

faster than one.

In fact they are much slower, because only one core has the data in
D2$.  The rest have to transfer the data from the bottom of the ocean
(i. e. from the core with the data through D3$) or from the moon
(SDRAM).  They are spending their time waiting for responses from

memory.

+/;.0 creates a virtual block for each section and passes that to +/

There is no need to move the data except for the reading required by

+/

.  If you run the +/ in a thread, the virtual block must be realized
with an explicit copy from the bottom of the ocean.  That doesn't add
much, because once the copy is made the data will be in D2$ of the
receiving core, but it is a small slowdown.

A thread needs to be able to run in its own core until it has done
reads+writes to D1$/D2$ at least, say, 100 times the size of its
arguments+result.  +/ . * is a perfect example.  On large matrices

the

arguments are cycled through repeatedly.

Henry Rich

On 1/25/2023 7:08 AM, vadim wrote:

Hi, please consider this:

        ((0,:2),:(2,:2)) (< @: +: @: ]);.0 i. 4
+---+---+
|0 1|2 3|
+---+---+
        ((0,:2),:(2,:2)) (< @: (+/) @: ]);.0 i. 4
+---+---+
|0 1|2 3|
+---+---+
        ((0,:2),:(2,:2)) (< @: (\:~) @: ]);.0 i. 4
+---+---+
|0 1|2 3|
+---+---+


No issues in 8.07; and a bug (that's what I'd call it) in 9.03 and

9.04.

Looks like it happens if the left arg has multiple ranges; and a

verb

to

apply is composed with "same" and "box" verbs as first and last in
sequence. But it's at 1st glance only. Sometimes omitting

parentheses

would

help (which clearly means parsing issue?). All these produce

expected

output:

        (2,:2) (< @: +: @: ]);.0 i. 4
+---+
|4 6|
+---+
        ((0,:2),:(2,:2)) (] @: +: @: ]);.0 i. 4
0 2
4 6
        ((0,:2),:(2,:2)) (< @: +/ @: ]);.0 i. 4
+-+-+
|1|5|
+-+-+
        (0,:2) (< @: (\:~) @: ]);.0 i. 4
+---+
|1 0|
+---+
        ((0,:2),:(2,:2)) (] @: (\:~) @: ]);.0 i. 4
1 0
3 2
        ((0,:2),:(2,:2)) (< @: (\:~));.0 i. 4
+---+---+
|1 0|3 2|
+---+---+


While "why would you want to use the ']' here?" would be reasonable

to

ask,

but, in the end, syntax is either correct or not. In fact, I was

testing

all kinds of different constructs investigating why multi-threaded
operation (with "t.") on subarrays is so much slower than expected,
although it's perhaps totally unrelated to what's discovered above.

Best regards,
Vadim

----------------------------------------------------------------------

For information about J forums see

http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------

For information about J forums see

http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see

http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm


----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Version 9 sorting anomaly? (was: Re: Unexpected results (a bug?) in version 9 with some cases of cut (subarray))

Reply via email to