Thank you for fixing the issue and for explanation. I would appreciate it
if you could look into the performance issue I'm observing with my
multi-threaded example (reduced to uselessness for sake of presentation,
though). I will of course accept "that's how things are with CPUs/memory"
and adjust expectations accordingly, and I understand results depend on
hardware and dataset very much. Too small a chance it's J issue.

The question concerns sorting data (I'm not doing the '+/' in threads :)).
I have 4 cores CPU, so I start 3 additional threads:

{{ for. i. 3 do. 0 T. 0 end. }} ''

Suppose I have literal data, 10 million rows 8 bytes each:

lits =: a. {~ ? ((1 * 10^7), 8) $ 256
q =: >. -: -: # lits
cut_by =: q ,:~ "0 q * i. 4
quarter =: q {. lits

   4 (6!:2) '/:~ lits'
1.37698
   4 (6!:2) '/:~ quarter'
0.308181
   4 (6!:2) '; cut_by /:~t.(0$0);.0 lits'
0.414603

Excellent, times I see match nicely, I understand there's overhead with
threads. Next, there are 10 millions of 8-byte integers:

ints =: ? (1 * 10^7) $ (10^18)
q =: >. -: -: # ints
cut_by =: q ,:~ "0 q * i. 4
quarter =: q {. ints

   4 (6!:2) '/:~ ints'
0.561057
   4 (6!:2) '/:~ quarter'
0.124735
   4 (6!:2) '; cut_by /:~t.(0$0);.0 ints'
0.441807

And I don't like this third time. There's roughly the same amount of data
for "threads overhead". My expectation for this result was approx. 150-200
ms or so.

In fact, looking at 415 ms for literals and 442 ms for numbers irked me
into temptation:

head =. (3&(3!:4) 16be2), ,|."1 (3&(3!:4)"0) 4,q,1,q

_8 ]\ a. i. head

226 0 0 0 0 0 0 0

0 0 0 0 0 0 0 4

0 0 0 0 0 38 37 160

0 0 0 0 0 0 0 1

0 0 0 0 0 38 37 160



to_bytes =: 5&}. @: (_8&(]\)) @: (2&(3!:1))

from_bytes =: (3!:2) @: (head&,) @: ,


(from_bytes /:~ to_bytes quarter) -: (/:~ quarter) NB. sane still?

1


4 (6!:2) ';cut_by (from_bytes @: /:~ @: to_bytes)t.(0$0);.0 ints'

0.51177


Ah, it didn't work. But perhaps it could with certain data, CPU model,
number of cores? So my question is if you could confirm that it (slower
than I expected speed with numerical sort in threads) is neither J issue,
nor 't.', nor '/:~'. Sorry if I wasted your time.

Best regards,
Vadim


On Wed, Jan 25, 2023 at 8:02 PM Henry Rich <[email protected]> wrote:
>
> Fixed for the release.  Thanks for the clear report.  The problem was
> specific to the forms you mentioned. Workaround: use
>
> < @: ((+/) @:])   instead of   < @: (+/) @:]
>
> The form <@:f is given special treatment.  Your form was incorrectly
> being given that treatment.
>
>
> If t. in cut is not meeting your expectations, perhaps you should adjust
> your expectations.  Verbs like (+/) will not benefit from threading in
> most cases, and may slow down considerably.  +/;.0 might be even worse.
>
> Why?  Because +/ is totally limited by the speed of reading from
> memory.  If the data fits in level-2 data cache (D2$) many cores are no
> faster than one.
>
> In fact they are much slower, because only one core has the data in
> D2$.  The rest have to transfer the data from the bottom of the ocean
> (i. e. from the core with the data through D3$) or from the moon
> (SDRAM).  They are spending their time waiting for responses from memory.
>
> +/;.0 creates a virtual block for each section and passes that to +/ .
> There is no need to move the data except for the reading required by +/
> .  If you run the +/ in a thread, the virtual block must be realized
> with an explicit copy from the bottom of the ocean.  That doesn't add
> much, because once the copy is made the data will be in D2$ of the
> receiving core, but it is a small slowdown.
>
> A thread needs to be able to run in its own core until it has done
> reads+writes to D1$/D2$ at least, say, 100 times the size of its
> arguments+result.  +/ . * is a perfect example.  On large matrices the
> arguments are cycled through repeatedly.
>
> Henry Rich
>
> On 1/25/2023 7:08 AM, vadim wrote:
> > Hi, please consider this:
> >
> >     ((0,:2),:(2,:2)) (< @: +: @: ]);.0 i. 4
> > +---+---+
> > |0 1|2 3|
> > +---+---+
> >     ((0,:2),:(2,:2)) (< @: (+/) @: ]);.0 i. 4
> > +---+---+
> > |0 1|2 3|
> > +---+---+
> >     ((0,:2),:(2,:2)) (< @: (\:~) @: ]);.0 i. 4
> > +---+---+
> > |0 1|2 3|
> > +---+---+
> >
> >
> > No issues in 8.07; and a bug (that's what I'd call it) in 9.03 and 9.04.
> > Looks like it happens if the left arg has multiple ranges; and a verb to
> > apply is composed with "same" and "box" verbs as first and last in
> > sequence. But it's at 1st glance only. Sometimes omitting parentheses
would
> > help (which clearly means parsing issue?). All these produce expected
> > output:
> >
> >     (2,:2) (< @: +: @: ]);.0 i. 4
> > +---+
> > |4 6|
> > +---+
> >     ((0,:2),:(2,:2)) (] @: +: @: ]);.0 i. 4
> > 0 2
> > 4 6
> >     ((0,:2),:(2,:2)) (< @: +/ @: ]);.0 i. 4
> > +-+-+
> > |1|5|
> > +-+-+
> >     (0,:2) (< @: (\:~) @: ]);.0 i. 4
> > +---+
> > |1 0|
> > +---+
> >     ((0,:2),:(2,:2)) (] @: (\:~) @: ]);.0 i. 4
> > 1 0
> > 3 2
> >     ((0,:2),:(2,:2)) (< @: (\:~));.0 i. 4
> > +---+---+
> > |1 0|3 2|
> > +---+---+
> >
> >
> > While "why would you want to use the ']' here?" would be reasonable to
ask,
> > but, in the end, syntax is either correct or not. In fact, I was testing
> > all kinds of different constructs investigating why multi-threaded
> > operation (with "t.") on subarrays is so much slower than expected,
> > although it's perhaps totally unrelated to what's discovered above.
> >
> > Best regards,
> > Vadim
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to