Thank you for fixing the issue and for explanation. I would appreciate it
if you could look into the performance issue I'm observing with my
multi-threaded example (reduced to uselessness for sake of presentation,
though). I will of course accept "that's how things are with CPUs/memory"
and adjust expectations accordingly, and I understand results depend on
hardware and dataset very much. Too small a chance it's J issue.
The question concerns sorting data (I'm not doing the '+/' in threads :)).
I have 4 cores CPU, so I start 3 additional threads:
{{ for. i. 3 do. 0 T. 0 end. }} ''
Suppose I have literal data, 10 million rows 8 bytes each:
lits =: a. {~ ? ((1 * 10^7), 8) $ 256
q =: >. -: -: # lits
cut_by =: q ,:~ "0 q * i. 4
quarter =: q {. lits
4 (6!:2) '/:~ lits'
1.37698
4 (6!:2) '/:~ quarter'
0.308181
4 (6!:2) '; cut_by /:~t.(0$0);.0 lits'
0.414603
Excellent, times I see match nicely, I understand there's overhead with
threads. Next, there are 10 millions of 8-byte integers:
ints =: ? (1 * 10^7) $ (10^18)
q =: >. -: -: # ints
cut_by =: q ,:~ "0 q * i. 4
quarter =: q {. ints
4 (6!:2) '/:~ ints'
0.561057
4 (6!:2) '/:~ quarter'
0.124735
4 (6!:2) '; cut_by /:~t.(0$0);.0 ints'
0.441807
And I don't like this third time. There's roughly the same amount of data
for "threads overhead". My expectation for this result was approx. 150-200
ms or so.
In fact, looking at 415 ms for literals and 442 ms for numbers irked me
into temptation:
head =. (3&(3!:4) 16be2), ,|."1 (3&(3!:4)"0) 4,q,1,q
_8 ]\ a. i. head
226 0 0 0 0 0 0 0
0 0 0 0 0 0 0 4
0 0 0 0 0 38 37 160
0 0 0 0 0 0 0 1
0 0 0 0 0 38 37 160
to_bytes =: 5&}. @: (_8&(]\)) @: (2&(3!:1))
from_bytes =: (3!:2) @: (head&,) @: ,
(from_bytes /:~ to_bytes quarter) -: (/:~ quarter) NB. sane still?
1
4 (6!:2) ';cut_by (from_bytes @: /:~ @: to_bytes)t.(0$0);.0 ints'
0.51177
Ah, it didn't work. But perhaps it could with certain data, CPU model,
number of cores? So my question is if you could confirm that it (slower
than I expected speed with numerical sort in threads) is neither J issue,
nor 't.', nor '/:~'. Sorry if I wasted your time.
Best regards,
Vadim
On Wed, Jan 25, 2023 at 8:02 PM Henry Rich <[email protected]> wrote:
>
> Fixed for the release. Thanks for the clear report. The problem was
> specific to the forms you mentioned. Workaround: use
>
> < @: ((+/) @:]) instead of < @: (+/) @:]
>
> The form <@:f is given special treatment. Your form was incorrectly
> being given that treatment.
>
>
> If t. in cut is not meeting your expectations, perhaps you should adjust
> your expectations. Verbs like (+/) will not benefit from threading in
> most cases, and may slow down considerably. +/;.0 might be even worse.
>
> Why? Because +/ is totally limited by the speed of reading from
> memory. If the data fits in level-2 data cache (D2$) many cores are no
> faster than one.
>
> In fact they are much slower, because only one core has the data in
> D2$. The rest have to transfer the data from the bottom of the ocean
> (i. e. from the core with the data through D3$) or from the moon
> (SDRAM). They are spending their time waiting for responses from memory.
>
> +/;.0 creates a virtual block for each section and passes that to +/ .
> There is no need to move the data except for the reading required by +/
> . If you run the +/ in a thread, the virtual block must be realized
> with an explicit copy from the bottom of the ocean. That doesn't add
> much, because once the copy is made the data will be in D2$ of the
> receiving core, but it is a small slowdown.
>
> A thread needs to be able to run in its own core until it has done
> reads+writes to D1$/D2$ at least, say, 100 times the size of its
> arguments+result. +/ . * is a perfect example. On large matrices the
> arguments are cycled through repeatedly.
>
> Henry Rich
>
> On 1/25/2023 7:08 AM, vadim wrote:
> > Hi, please consider this:
> >
> > ((0,:2),:(2,:2)) (< @: +: @: ]);.0 i. 4
> > +---+---+
> > |0 1|2 3|
> > +---+---+
> > ((0,:2),:(2,:2)) (< @: (+/) @: ]);.0 i. 4
> > +---+---+
> > |0 1|2 3|
> > +---+---+
> > ((0,:2),:(2,:2)) (< @: (\:~) @: ]);.0 i. 4
> > +---+---+
> > |0 1|2 3|
> > +---+---+
> >
> >
> > No issues in 8.07; and a bug (that's what I'd call it) in 9.03 and 9.04.
> > Looks like it happens if the left arg has multiple ranges; and a verb to
> > apply is composed with "same" and "box" verbs as first and last in
> > sequence. But it's at 1st glance only. Sometimes omitting parentheses
would
> > help (which clearly means parsing issue?). All these produce expected
> > output:
> >
> > (2,:2) (< @: +: @: ]);.0 i. 4
> > +---+
> > |4 6|
> > +---+
> > ((0,:2),:(2,:2)) (] @: +: @: ]);.0 i. 4
> > 0 2
> > 4 6
> > ((0,:2),:(2,:2)) (< @: +/ @: ]);.0 i. 4
> > +-+-+
> > |1|5|
> > +-+-+
> > (0,:2) (< @: (\:~) @: ]);.0 i. 4
> > +---+
> > |1 0|
> > +---+
> > ((0,:2),:(2,:2)) (] @: (\:~) @: ]);.0 i. 4
> > 1 0
> > 3 2
> > ((0,:2),:(2,:2)) (< @: (\:~));.0 i. 4
> > +---+---+
> > |1 0|3 2|
> > +---+---+
> >
> >
> > While "why would you want to use the ']' here?" would be reasonable to
ask,
> > but, in the end, syntax is either correct or not. In fact, I was testing
> > all kinds of different constructs investigating why multi-threaded
> > operation (with "t.") on subarrays is so much slower than expected,
> > although it's perhaps totally unrelated to what's discovered above.
> >
> > Best regards,
> > Vadim
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm