Re: [Jprogramming] Unexpected results (a bug?) in version 9 with some cases of cut (subarray)

vadim Fri, 27 Jan 2023 03:07:59 -0800

> What result do you get when you use this syntax?

"Don't use master thread, use worker threads only", correct? So, I'm
allocating 3 workers for 4 equal pieces of work; 3 workers finish doing
their task each, then 2 of them are idle, and 1 alone does the 4th part?


Interestingly, with literals I get predictable time increase from ~400 ms
to ~650 ms. Which concurs, more or less. With integers, I get the same ~440
ms. How very strange. And it's the same ~420 ... 440 ms if I cut ints into
2 or 3 parts (master thread is idle then).

-------------
I don't know if "parallelizable primitives" (including "sort") will arrive
with 9.04 release, and maybe what follows was already mentioned in
community, but with sorting literals there's effective and very lazy way to
gain speed through using threads as they are implemented now:

(6!:2) '/:~ lits'

1.37036

(6!:2) '/:~ ;cut_by /:~t.(0$0);.0 lits'

0.665364

1 T. ''

3
-------------
Henry, I only have shallow understanding, my practical point of view was
"try to divide work into preferably large and easy to manage parts, then
oversee the whole gang of workers are busy and none of them slacks". Thank
you very much for the detailed explanation, this cutting into very small
pieces is new to me, I'll now see if I can implement it in practice.
Sorting is only part of a larger whole; it's just that I noticed
interesting results while experimenting.


Best regards,
Vadim


On Fri, Jan 27, 2023 at 12:11 AM Ak O <[email protected]> wrote:

> Hi Vadim,
>
> What result do you get when you use this syntax?
>
>      ; cut_by /:~t.(<'worker';1);.0 ints
>
> Ak.
>
>
> On Thu., Jan. 26, 2023, 08:46 Henry Rich, <[email protected]> wrote:
>
> > 1. Make sure you turn off all browsers etc. when you time multrithreaded
> > code.
> >
> > 2. DO NOT give an x argument to 6!:2.  Repeated execution of the same
> > sentence may tend to migrate values into the core they will be used in,
> > which will underestimate the time required for a single run.  If you
> > have to time anything complicated you may have to put in your own timing
> > points.
> >
> > 3. 'Thread overhead' is more than just data movement.  It includes time
> > required to lock variable-name lookups and contention for D3$.  The sort
> > of 8-byte literals uses a merge sort IIRC, while the sort of integers
> > uses a lovingly-coded quicksort.  The quicksort is faster, as you can
> > see.  Mergesort is very cache-friendly: it reads and writes to memory
> > sequentially. Quicksort hops around, storing single words into two
> > different streams.  On a single thread that doesn't matter much, but
> > with multiple threads and arguments far larger than D3$, the writing
> > spills to DRAM: mergesort needs only one open DRAM page per thread while
> > quicksort needs two per thread.  That will have to make a difference;
> > how much I don't know.
> >
> > 4. If you want to split sorting into tasks, split it so that each task
> > fits into D2$ of a core.  With 10M numbers (80MB) and a machine with 1MB
> > D2$, let each task have 125K numbers.  Experiment with smaller blocks.
> > When you have the right size, the thread overhead will be comparable to
> > the time spent making two copies of the whole array.
> >
> > 5. If you find that you can reduce the overhead that far, we could have
> > a faster way to sort in J using multithreads: split the array into as
> > many pieces as needed and sort them efficiently in threads, then
> > transfer the data back to the master thread which merges the ordered
> > strings.  The transfer back to the master thread is unavoidable, but the
> > time required to merge (a reheaping operation) would be completely
> > hidden under the transfer time.
> >
> > 6. Currently, starting a thread realizes any virtual argument in the
> > master thread.  It would be better to realize it in the worker thread
> > where the data is needed.
> >
> > Henry Rich
> >
> > On 1/25/2023 6:11 PM, vadim wrote:
> > > Thank you for fixing the issue and for explanation. I would appreciate
> it
> > > if you could look into the performance issue I'm observing with my
> > > multi-threaded example (reduced to uselessness for sake of
> presentation,
> > > though). I will of course accept "that's how things are with
> CPUs/memory"
> > > and adjust expectations accordingly, and I understand results depend on
> > > hardware and dataset very much. Too small a chance it's J issue.
> > >
> > > The question concerns sorting data (I'm not doing the '+/' in threads
> > :)).
> > > I have 4 cores CPU, so I start 3 additional threads:
> > >
> > > {{ for. i. 3 do. 0 T. 0 end. }} ''
> > >
> > > Suppose I have literal data, 10 million rows 8 bytes each:
> > >
> > > lits =: a. {~ ? ((1 * 10^7), 8) $ 256
> > > q =: >. -: -: # lits
> > > cut_by =: q ,:~ "0 q * i. 4
> > > quarter =: q {. lits
> > >
> > >     4 (6!:2) '/:~ lits'
> > > 1.37698
> > >     4 (6!:2) '/:~ quarter'
> > > 0.308181
> > >     4 (6!:2) '; cut_by /:~t.(0$0);.0 lits'
> > > 0.414603
> > >
> > > Excellent, times I see match nicely, I understand there's overhead with
> > > threads. Next, there are 10 millions of 8-byte integers:
> > >
> > > ints =: ? (1 * 10^7) $ (10^18)
> > > q =: >. -: -: # ints
> > > cut_by =: q ,:~ "0 q * i. 4
> > > quarter =: q {. ints
> > >
> > >     4 (6!:2) '/:~ ints'
> > > 0.561057
> > >     4 (6!:2) '/:~ quarter'
> > > 0.124735
> > >     4 (6!:2) '; cut_by /:~t.(0$0);.0 ints'
> > > 0.441807
> > >
> > > And I don't like this third time. There's roughly the same amount of
> data
> > > for "threads overhead". My expectation for this result was approx.
> > 150-200
> > > ms or so.
> > >
> > > In fact, looking at 415 ms for literals and 442 ms for numbers irked me
> > > into temptation:
> > >
> > > head =. (3&(3!:4) 16be2), ,|."1 (3&(3!:4)"0) 4,q,1,q
> > >
> > > _8 ]\ a. i. head
> > >
> > > 226 0 0 0 0 0 0 0
> > >
> > > 0 0 0 0 0 0 0 4
> > >
> > > 0 0 0 0 0 38 37 160
> > >
> > > 0 0 0 0 0 0 0 1
> > >
> > > 0 0 0 0 0 38 37 160
> > >
> > >
> > >
> > > to_bytes =: 5&}. @: (_8&(]\)) @: (2&(3!:1))
> > >
> > > from_bytes =: (3!:2) @: (head&,) @: ,
> > >
> > >
> > > (from_bytes /:~ to_bytes quarter) -: (/:~ quarter) NB. sane still?
> > >
> > > 1
> > >
> > >
> > > 4 (6!:2) ';cut_by (from_bytes @: /:~ @: to_bytes)t.(0$0);.0 ints'
> > >
> > > 0.51177
> > >
> > >
> > > Ah, it didn't work. But perhaps it could with certain data, CPU model,
> > > number of cores? So my question is if you could confirm that it (slower
> > > than I expected speed with numerical sort in threads) is neither J
> issue,
> > > nor 't.', nor '/:~'. Sorry if I wasted your time.
> > >
> > > Best regards,
> > > Vadim
> > >
> > >
> > > On Wed, Jan 25, 2023 at 8:02 PM Henry Rich <[email protected]>
> wrote:
> > >> Fixed for the release.  Thanks for the clear report.  The problem was
> > >> specific to the forms you mentioned. Workaround: use
> > >>
> > >> < @: ((+/) @:])   instead of   < @: (+/) @:]
> > >>
> > >> The form <@:f is given special treatment.  Your form was incorrectly
> > >> being given that treatment.
> > >>
> > >>
> > >> If t. in cut is not meeting your expectations, perhaps you should
> adjust
> > >> your expectations.  Verbs like (+/) will not benefit from threading in
> > >> most cases, and may slow down considerably.  +/;.0 might be even
> worse.
> > >>
> > >> Why?  Because +/ is totally limited by the speed of reading from
> > >> memory.  If the data fits in level-2 data cache (D2$) many cores are
> no
> > >> faster than one.
> > >>
> > >> In fact they are much slower, because only one core has the data in
> > >> D2$.  The rest have to transfer the data from the bottom of the ocean
> > >> (i. e. from the core with the data through D3$) or from the moon
> > >> (SDRAM).  They are spending their time waiting for responses from
> > memory.
> > >>
> > >> +/;.0 creates a virtual block for each section and passes that to +/ .
> > >> There is no need to move the data except for the reading required by
> +/
> > >> .  If you run the +/ in a thread, the virtual block must be realized
> > >> with an explicit copy from the bottom of the ocean.  That doesn't add
> > >> much, because once the copy is made the data will be in D2$ of the
> > >> receiving core, but it is a small slowdown.
> > >>
> > >> A thread needs to be able to run in its own core until it has done
> > >> reads+writes to D1$/D2$ at least, say, 100 times the size of its
> > >> arguments+result.  +/ . * is a perfect example.  On large matrices the
> > >> arguments are cycled through repeatedly.
> > >>
> > >> Henry Rich
> > >>
> > >> On 1/25/2023 7:08 AM, vadim wrote:
> > >>> Hi, please consider this:
> > >>>
> > >>>      ((0,:2),:(2,:2)) (< @: +: @: ]);.0 i. 4
> > >>> +---+---+
> > >>> |0 1|2 3|
> > >>> +---+---+
> > >>>      ((0,:2),:(2,:2)) (< @: (+/) @: ]);.0 i. 4
> > >>> +---+---+
> > >>> |0 1|2 3|
> > >>> +---+---+
> > >>>      ((0,:2),:(2,:2)) (< @: (\:~) @: ]);.0 i. 4
> > >>> +---+---+
> > >>> |0 1|2 3|
> > >>> +---+---+
> > >>>
> > >>>
> > >>> No issues in 8.07; and a bug (that's what I'd call it) in 9.03 and
> > 9.04.
> > >>> Looks like it happens if the left arg has multiple ranges; and a verb
> > to
> > >>> apply is composed with "same" and "box" verbs as first and last in
> > >>> sequence. But it's at 1st glance only. Sometimes omitting parentheses
> > > would
> > >>> help (which clearly means parsing issue?). All these produce expected
> > >>> output:
> > >>>
> > >>>      (2,:2) (< @: +: @: ]);.0 i. 4
> > >>> +---+
> > >>> |4 6|
> > >>> +---+
> > >>>      ((0,:2),:(2,:2)) (] @: +: @: ]);.0 i. 4
> > >>> 0 2
> > >>> 4 6
> > >>>      ((0,:2),:(2,:2)) (< @: +/ @: ]);.0 i. 4
> > >>> +-+-+
> > >>> |1|5|
> > >>> +-+-+
> > >>>      (0,:2) (< @: (\:~) @: ]);.0 i. 4
> > >>> +---+
> > >>> |1 0|
> > >>> +---+
> > >>>      ((0,:2),:(2,:2)) (] @: (\:~) @: ]);.0 i. 4
> > >>> 1 0
> > >>> 3 2
> > >>>      ((0,:2),:(2,:2)) (< @: (\:~));.0 i. 4
> > >>> +---+---+
> > >>> |1 0|3 2|
> > >>> +---+---+
> > >>>
> > >>>
> > >>> While "why would you want to use the ']' here?" would be reasonable
> to
> > > ask,
> > >>> but, in the end, syntax is either correct or not. In fact, I was
> > testing
> > >>> all kinds of different constructs investigating why multi-threaded
> > >>> operation (with "t.") on subarrays is so much slower than expected,
> > >>> although it's perhaps totally unrelated to what's discovered above.
> > >>>
> > >>> Best regards,
> > >>> Vadim
> > >>>
> ----------------------------------------------------------------------
> > >>> For information about J forums see
> http://www.jsoftware.com/forums.htm
> > >> ----------------------------------------------------------------------
> > >> For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Unexpected results (a bug?) in version 9 with some cases of cut (subarray)

Reply via email to