Hi Vadim,

What result do you get when you use this syntax?

     ; cut_by /:~t.(<'worker';1);.0 ints

Ak.


On Thu., Jan. 26, 2023, 08:46 Henry Rich, <henryhr...@gmail.com> wrote:

> 1. Make sure you turn off all browsers etc. when you time multrithreaded
> code.
>
> 2. DO NOT give an x argument to 6!:2.  Repeated execution of the same
> sentence may tend to migrate values into the core they will be used in,
> which will underestimate the time required for a single run.  If you
> have to time anything complicated you may have to put in your own timing
> points.
>
> 3. 'Thread overhead' is more than just data movement.  It includes time
> required to lock variable-name lookups and contention for D3$.  The sort
> of 8-byte literals uses a merge sort IIRC, while the sort of integers
> uses a lovingly-coded quicksort.  The quicksort is faster, as you can
> see.  Mergesort is very cache-friendly: it reads and writes to memory
> sequentially. Quicksort hops around, storing single words into two
> different streams.  On a single thread that doesn't matter much, but
> with multiple threads and arguments far larger than D3$, the writing
> spills to DRAM: mergesort needs only one open DRAM page per thread while
> quicksort needs two per thread.  That will have to make a difference;
> how much I don't know.
>
> 4. If you want to split sorting into tasks, split it so that each task
> fits into D2$ of a core.  With 10M numbers (80MB) and a machine with 1MB
> D2$, let each task have 125K numbers.  Experiment with smaller blocks.
> When you have the right size, the thread overhead will be comparable to
> the time spent making two copies of the whole array.
>
> 5. If you find that you can reduce the overhead that far, we could have
> a faster way to sort in J using multithreads: split the array into as
> many pieces as needed and sort them efficiently in threads, then
> transfer the data back to the master thread which merges the ordered
> strings.  The transfer back to the master thread is unavoidable, but the
> time required to merge (a reheaping operation) would be completely
> hidden under the transfer time.
>
> 6. Currently, starting a thread realizes any virtual argument in the
> master thread.  It would be better to realize it in the worker thread
> where the data is needed.
>
> Henry Rich
>
> On 1/25/2023 6:11 PM, vadim wrote:
> > Thank you for fixing the issue and for explanation. I would appreciate it
> > if you could look into the performance issue I'm observing with my
> > multi-threaded example (reduced to uselessness for sake of presentation,
> > though). I will of course accept "that's how things are with CPUs/memory"
> > and adjust expectations accordingly, and I understand results depend on
> > hardware and dataset very much. Too small a chance it's J issue.
> >
> > The question concerns sorting data (I'm not doing the '+/' in threads
> :)).
> > I have 4 cores CPU, so I start 3 additional threads:
> >
> > {{ for. i. 3 do. 0 T. 0 end. }} ''
> >
> > Suppose I have literal data, 10 million rows 8 bytes each:
> >
> > lits =: a. {~ ? ((1 * 10^7), 8) $ 256
> > q =: >. -: -: # lits
> > cut_by =: q ,:~ "0 q * i. 4
> > quarter =: q {. lits
> >
> >     4 (6!:2) '/:~ lits'
> > 1.37698
> >     4 (6!:2) '/:~ quarter'
> > 0.308181
> >     4 (6!:2) '; cut_by /:~t.(0$0);.0 lits'
> > 0.414603
> >
> > Excellent, times I see match nicely, I understand there's overhead with
> > threads. Next, there are 10 millions of 8-byte integers:
> >
> > ints =: ? (1 * 10^7) $ (10^18)
> > q =: >. -: -: # ints
> > cut_by =: q ,:~ "0 q * i. 4
> > quarter =: q {. ints
> >
> >     4 (6!:2) '/:~ ints'
> > 0.561057
> >     4 (6!:2) '/:~ quarter'
> > 0.124735
> >     4 (6!:2) '; cut_by /:~t.(0$0);.0 ints'
> > 0.441807
> >
> > And I don't like this third time. There's roughly the same amount of data
> > for "threads overhead". My expectation for this result was approx.
> 150-200
> > ms or so.
> >
> > In fact, looking at 415 ms for literals and 442 ms for numbers irked me
> > into temptation:
> >
> > head =. (3&(3!:4) 16be2), ,|."1 (3&(3!:4)"0) 4,q,1,q
> >
> > _8 ]\ a. i. head
> >
> > 226 0 0 0 0 0 0 0
> >
> > 0 0 0 0 0 0 0 4
> >
> > 0 0 0 0 0 38 37 160
> >
> > 0 0 0 0 0 0 0 1
> >
> > 0 0 0 0 0 38 37 160
> >
> >
> >
> > to_bytes =: 5&}. @: (_8&(]\)) @: (2&(3!:1))
> >
> > from_bytes =: (3!:2) @: (head&,) @: ,
> >
> >
> > (from_bytes /:~ to_bytes quarter) -: (/:~ quarter) NB. sane still?
> >
> > 1
> >
> >
> > 4 (6!:2) ';cut_by (from_bytes @: /:~ @: to_bytes)t.(0$0);.0 ints'
> >
> > 0.51177
> >
> >
> > Ah, it didn't work. But perhaps it could with certain data, CPU model,
> > number of cores? So my question is if you could confirm that it (slower
> > than I expected speed with numerical sort in threads) is neither J issue,
> > nor 't.', nor '/:~'. Sorry if I wasted your time.
> >
> > Best regards,
> > Vadim
> >
> >
> > On Wed, Jan 25, 2023 at 8:02 PM Henry Rich <henryhr...@gmail.com> wrote:
> >> Fixed for the release.  Thanks for the clear report.  The problem was
> >> specific to the forms you mentioned. Workaround: use
> >>
> >> < @: ((+/) @:])   instead of   < @: (+/) @:]
> >>
> >> The form <@:f is given special treatment.  Your form was incorrectly
> >> being given that treatment.
> >>
> >>
> >> If t. in cut is not meeting your expectations, perhaps you should adjust
> >> your expectations.  Verbs like (+/) will not benefit from threading in
> >> most cases, and may slow down considerably.  +/;.0 might be even worse.
> >>
> >> Why?  Because +/ is totally limited by the speed of reading from
> >> memory.  If the data fits in level-2 data cache (D2$) many cores are no
> >> faster than one.
> >>
> >> In fact they are much slower, because only one core has the data in
> >> D2$.  The rest have to transfer the data from the bottom of the ocean
> >> (i. e. from the core with the data through D3$) or from the moon
> >> (SDRAM).  They are spending their time waiting for responses from
> memory.
> >>
> >> +/;.0 creates a virtual block for each section and passes that to +/ .
> >> There is no need to move the data except for the reading required by +/
> >> .  If you run the +/ in a thread, the virtual block must be realized
> >> with an explicit copy from the bottom of the ocean.  That doesn't add
> >> much, because once the copy is made the data will be in D2$ of the
> >> receiving core, but it is a small slowdown.
> >>
> >> A thread needs to be able to run in its own core until it has done
> >> reads+writes to D1$/D2$ at least, say, 100 times the size of its
> >> arguments+result.  +/ . * is a perfect example.  On large matrices the
> >> arguments are cycled through repeatedly.
> >>
> >> Henry Rich
> >>
> >> On 1/25/2023 7:08 AM, vadim wrote:
> >>> Hi, please consider this:
> >>>
> >>>      ((0,:2),:(2,:2)) (< @: +: @: ]);.0 i. 4
> >>> +---+---+
> >>> |0 1|2 3|
> >>> +---+---+
> >>>      ((0,:2),:(2,:2)) (< @: (+/) @: ]);.0 i. 4
> >>> +---+---+
> >>> |0 1|2 3|
> >>> +---+---+
> >>>      ((0,:2),:(2,:2)) (< @: (\:~) @: ]);.0 i. 4
> >>> +---+---+
> >>> |0 1|2 3|
> >>> +---+---+
> >>>
> >>>
> >>> No issues in 8.07; and a bug (that's what I'd call it) in 9.03 and
> 9.04.
> >>> Looks like it happens if the left arg has multiple ranges; and a verb
> to
> >>> apply is composed with "same" and "box" verbs as first and last in
> >>> sequence. But it's at 1st glance only. Sometimes omitting parentheses
> > would
> >>> help (which clearly means parsing issue?). All these produce expected
> >>> output:
> >>>
> >>>      (2,:2) (< @: +: @: ]);.0 i. 4
> >>> +---+
> >>> |4 6|
> >>> +---+
> >>>      ((0,:2),:(2,:2)) (] @: +: @: ]);.0 i. 4
> >>> 0 2
> >>> 4 6
> >>>      ((0,:2),:(2,:2)) (< @: +/ @: ]);.0 i. 4
> >>> +-+-+
> >>> |1|5|
> >>> +-+-+
> >>>      (0,:2) (< @: (\:~) @: ]);.0 i. 4
> >>> +---+
> >>> |1 0|
> >>> +---+
> >>>      ((0,:2),:(2,:2)) (] @: (\:~) @: ]);.0 i. 4
> >>> 1 0
> >>> 3 2
> >>>      ((0,:2),:(2,:2)) (< @: (\:~));.0 i. 4
> >>> +---+---+
> >>> |1 0|3 2|
> >>> +---+---+
> >>>
> >>>
> >>> While "why would you want to use the ']' here?" would be reasonable to
> > ask,
> >>> but, in the end, syntax is either correct or not. In fact, I was
> testing
> >>> all kinds of different constructs investigating why multi-threaded
> >>> operation (with "t.") on subarrays is so much slower than expected,
> >>> although it's perhaps totally unrelated to what's discovered above.
> >>>
> >>> Best regards,
> >>> Vadim
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to