[Jprogramming] Version 9 sorting anomaly? (was: Re: Unexpected results (a bug?) in version 9 with some cases of cut (subarray))

vadim Fri, 03 Feb 2023 14:01:33 -0800

So I tried to begin to implement "split into small sub-tasks", but got
stuck before any meaningful step, at what may be another version 9 issue (?)


Though I remember " DO NOT give an x argument to 6!:2", in tests below
there is x argument to catch solid time. Otherwise, I could cut 10^7
numbers into ~100 ranges to sort each (closer to my real task), then there
will be no x argument, with approx. the same "solid" time difference caught
(decimal point moved to the right 2 places of course), but cutting would
mask the real issue.

100 (6!:2) '/:~ nn' [ nn =: ? N $ 10^9 [ N =: 100000

0.00263482
100 (6!:2) '/:~ nn' [ nn =: ? N $ 10^9 [ N =: 99999

0.0073261
100 (6!:2) '}./:~ nn,_1' [ nn =: ? N $ 10^9 [ N =: 99999

0.00298486

If I reboot into Linux, it's not 3x, but 4x difference between tests 1 and
2. Test 3 is like ridiculous advice "don't sort 99999 (nor 99000, perhaps)
numbers; instead pad with throw-away values to 10^5 numbers. I hope my
result is not CPU-model specific, and you can observe something similar.

No issue in 8.07.

Best regards,
Vadim


On Fri, Jan 27, 2023 at 6:18 PM Henry Rich <[email protected]> wrote:

> J9.04 is frozen for the release, but J9.05 will follow.  This thread has
> convinced me that sorting large arrays can be greatly improved by
> multithreading, and also that I can reduce the threading overhead.
>
> Henry Rich
>
> On 1/27/2023 6:07 AM, vadim wrote:
> >> What result do you get when you use this syntax?
> > "Don't use master thread, use worker threads only", correct? So, I'm
> > allocating 3 workers for 4 equal pieces of work; 3 workers finish doing
> > their task each, then 2 of them are idle, and 1 alone does the 4th part?
> >
> > Interestingly, with literals I get predictable time increase from ~400 ms
> > to ~650 ms. Which concurs, more or less. With integers, I get the same
> ~440
> > ms. How very strange. And it's the same ~420 ... 440 ms if I cut ints
> into
> > 2 or 3 parts (master thread is idle then).
> >
> > -------------
> > I don't know if "parallelizable primitives" (including "sort") will
> arrive
> > with 9.04 release, and maybe what follows was already mentioned in
> > community, but with sorting literals there's effective and very lazy way
> to
> > gain speed through using threads as they are implemented now:
> >
> > (6!:2) '/:~ lits'
> >
> > 1.37036
> >
> > (6!:2) '/:~ ;cut_by /:~t.(0$0);.0 lits'
> >
> > 0.665364
> >
> > 1 T. ''
> >
> > 3
> > -------------
> > Henry, I only have shallow understanding, my practical point of view was
> > "try to divide work into preferably large and easy to manage parts, then
> > oversee the whole gang of workers are busy and none of them slacks".
> Thank
> > you very much for the detailed explanation, this cutting into very small
> > pieces is new to me, I'll now see if I can implement it in practice.
> > Sorting is only part of a larger whole; it's just that I noticed
> > interesting results while experimenting.
> >
> >
> > Best regards,
> > Vadim
> >
> >
> > On Fri, Jan 27, 2023 at 12:11 AM Ak O <[email protected]> wrote:
> >
> >> Hi Vadim,
> >>
> >> What result do you get when you use this syntax?
> >>
> >>       ; cut_by /:~t.(<'worker';1);.0 ints
> >>
> >> Ak.
> >>
> >>
> >> On Thu., Jan. 26, 2023, 08:46 Henry Rich, <[email protected]> wrote:
> >>
> >>> 1. Make sure you turn off all browsers etc. when you time
> multrithreaded
> >>> code.
> >>>
> >>> 2. DO NOT give an x argument to 6!:2.  Repeated execution of the same
> >>> sentence may tend to migrate values into the core they will be used in,
> >>> which will underestimate the time required for a single run.  If you
> >>> have to time anything complicated you may have to put in your own
> timing
> >>> points.
> >>>
> >>> 3. 'Thread overhead' is more than just data movement.  It includes time
> >>> required to lock variable-name lookups and contention for D3$.  The
> sort
> >>> of 8-byte literals uses a merge sort IIRC, while the sort of integers
> >>> uses a lovingly-coded quicksort.  The quicksort is faster, as you can
> >>> see.  Mergesort is very cache-friendly: it reads and writes to memory
> >>> sequentially. Quicksort hops around, storing single words into two
> >>> different streams.  On a single thread that doesn't matter much, but
> >>> with multiple threads and arguments far larger than D3$, the writing
> >>> spills to DRAM: mergesort needs only one open DRAM page per thread
> while
> >>> quicksort needs two per thread.  That will have to make a difference;
> >>> how much I don't know.
> >>>
> >>> 4. If you want to split sorting into tasks, split it so that each task
> >>> fits into D2$ of a core.  With 10M numbers (80MB) and a machine with
> 1MB
> >>> D2$, let each task have 125K numbers.  Experiment with smaller blocks.
> >>> When you have the right size, the thread overhead will be comparable to
> >>> the time spent making two copies of the whole array.
> >>>
> >>> 5. If you find that you can reduce the overhead that far, we could have
> >>> a faster way to sort in J using multithreads: split the array into as
> >>> many pieces as needed and sort them efficiently in threads, then
> >>> transfer the data back to the master thread which merges the ordered
> >>> strings.  The transfer back to the master thread is unavoidable, but
> the
> >>> time required to merge (a reheaping operation) would be completely
> >>> hidden under the transfer time.
> >>>
> >>> 6. Currently, starting a thread realizes any virtual argument in the
> >>> master thread.  It would be better to realize it in the worker thread
> >>> where the data is needed.
> >>>
> >>> Henry Rich
> >>>
> >>> On 1/25/2023 6:11 PM, vadim wrote:
> >>>> Thank you for fixing the issue and for explanation. I would appreciate
> >> it
> >>>> if you could look into the performance issue I'm observing with my
> >>>> multi-threaded example (reduced to uselessness for sake of
> >> presentation,
> >>>> though). I will of course accept "that's how things are with
> >> CPUs/memory"
> >>>> and adjust expectations accordingly, and I understand results depend
> on
> >>>> hardware and dataset very much. Too small a chance it's J issue.
> >>>>
> >>>> The question concerns sorting data (I'm not doing the '+/' in threads
> >>> :)).
> >>>> I have 4 cores CPU, so I start 3 additional threads:
> >>>>
> >>>> {{ for. i. 3 do. 0 T. 0 end. }} ''
> >>>>
> >>>> Suppose I have literal data, 10 million rows 8 bytes each:
> >>>>
> >>>> lits =: a. {~ ? ((1 * 10^7), 8) $ 256
> >>>> q =: >. -: -: # lits
> >>>> cut_by =: q ,:~ "0 q * i. 4
> >>>> quarter =: q {. lits
> >>>>
> >>>>      4 (6!:2) '/:~ lits'
> >>>> 1.37698
> >>>>      4 (6!:2) '/:~ quarter'
> >>>> 0.308181
> >>>>      4 (6!:2) '; cut_by /:~t.(0$0);.0 lits'
> >>>> 0.414603
> >>>>
> >>>> Excellent, times I see match nicely, I understand there's overhead
> with
> >>>> threads. Next, there are 10 millions of 8-byte integers:
> >>>>
> >>>> ints =: ? (1 * 10^7) $ (10^18)
> >>>> q =: >. -: -: # ints
> >>>> cut_by =: q ,:~ "0 q * i. 4
> >>>> quarter =: q {. ints
> >>>>
> >>>>      4 (6!:2) '/:~ ints'
> >>>> 0.561057
> >>>>      4 (6!:2) '/:~ quarter'
> >>>> 0.124735
> >>>>      4 (6!:2) '; cut_by /:~t.(0$0);.0 ints'
> >>>> 0.441807
> >>>>
> >>>> And I don't like this third time. There's roughly the same amount of
> >> data
> >>>> for "threads overhead". My expectation for this result was approx.
> >>> 150-200
> >>>> ms or so.
> >>>>
> >>>> In fact, looking at 415 ms for literals and 442 ms for numbers irked
> me
> >>>> into temptation:
> >>>>
> >>>> head =. (3&(3!:4) 16be2), ,|."1 (3&(3!:4)"0) 4,q,1,q
> >>>>
> >>>> _8 ]\ a. i. head
> >>>>
> >>>> 226 0 0 0 0 0 0 0
> >>>>
> >>>> 0 0 0 0 0 0 0 4
> >>>>
> >>>> 0 0 0 0 0 38 37 160
> >>>>
> >>>> 0 0 0 0 0 0 0 1
> >>>>
> >>>> 0 0 0 0 0 38 37 160
> >>>>
> >>>>
> >>>>
> >>>> to_bytes =: 5&}. @: (_8&(]\)) @: (2&(3!:1))
> >>>>
> >>>> from_bytes =: (3!:2) @: (head&,) @: ,
> >>>>
> >>>>
> >>>> (from_bytes /:~ to_bytes quarter) -: (/:~ quarter) NB. sane still?
> >>>>
> >>>> 1
> >>>>
> >>>>
> >>>> 4 (6!:2) ';cut_by (from_bytes @: /:~ @: to_bytes)t.(0$0);.0 ints'
> >>>>
> >>>> 0.51177
> >>>>
> >>>>
> >>>> Ah, it didn't work. But perhaps it could with certain data, CPU model,
> >>>> number of cores? So my question is if you could confirm that it
> (slower
> >>>> than I expected speed with numerical sort in threads) is neither J
> >> issue,
> >>>> nor 't.', nor '/:~'. Sorry if I wasted your time.
> >>>>
> >>>> Best regards,
> >>>> Vadim
> >>>>
> >>>>
> >>>> On Wed, Jan 25, 2023 at 8:02 PM Henry Rich <[email protected]>
> >> wrote:
> >>>>> Fixed for the release.  Thanks for the clear report.  The problem was
> >>>>> specific to the forms you mentioned. Workaround: use
> >>>>>
> >>>>> < @: ((+/) @:])   instead of   < @: (+/) @:]
> >>>>>
> >>>>> The form <@:f is given special treatment.  Your form was incorrectly
> >>>>> being given that treatment.
> >>>>>
> >>>>>
> >>>>> If t. in cut is not meeting your expectations, perhaps you should
> >> adjust
> >>>>> your expectations.  Verbs like (+/) will not benefit from threading
> in
> >>>>> most cases, and may slow down considerably.  +/;.0 might be even
> >> worse.
> >>>>> Why?  Because +/ is totally limited by the speed of reading from
> >>>>> memory.  If the data fits in level-2 data cache (D2$) many cores are
> >> no
> >>>>> faster than one.
> >>>>>
> >>>>> In fact they are much slower, because only one core has the data in
> >>>>> D2$.  The rest have to transfer the data from the bottom of the ocean
> >>>>> (i. e. from the core with the data through D3$) or from the moon
> >>>>> (SDRAM).  They are spending their time waiting for responses from
> >>> memory.
> >>>>> +/;.0 creates a virtual block for each section and passes that to +/
> .
> >>>>> There is no need to move the data except for the reading required by
> >> +/
> >>>>> .  If you run the +/ in a thread, the virtual block must be realized
> >>>>> with an explicit copy from the bottom of the ocean.  That doesn't add
> >>>>> much, because once the copy is made the data will be in D2$ of the
> >>>>> receiving core, but it is a small slowdown.
> >>>>>
> >>>>> A thread needs to be able to run in its own core until it has done
> >>>>> reads+writes to D1$/D2$ at least, say, 100 times the size of its
> >>>>> arguments+result.  +/ . * is a perfect example.  On large matrices
> the
> >>>>> arguments are cycled through repeatedly.
> >>>>>
> >>>>> Henry Rich
> >>>>>
> >>>>> On 1/25/2023 7:08 AM, vadim wrote:
> >>>>>> Hi, please consider this:
> >>>>>>
> >>>>>>       ((0,:2),:(2,:2)) (< @: +: @: ]);.0 i. 4
> >>>>>> +---+---+
> >>>>>> |0 1|2 3|
> >>>>>> +---+---+
> >>>>>>       ((0,:2),:(2,:2)) (< @: (+/) @: ]);.0 i. 4
> >>>>>> +---+---+
> >>>>>> |0 1|2 3|
> >>>>>> +---+---+
> >>>>>>       ((0,:2),:(2,:2)) (< @: (\:~) @: ]);.0 i. 4
> >>>>>> +---+---+
> >>>>>> |0 1|2 3|
> >>>>>> +---+---+
> >>>>>>
> >>>>>>
> >>>>>> No issues in 8.07; and a bug (that's what I'd call it) in 9.03 and
> >>> 9.04.
> >>>>>> Looks like it happens if the left arg has multiple ranges; and a
> verb
> >>> to
> >>>>>> apply is composed with "same" and "box" verbs as first and last in
> >>>>>> sequence. But it's at 1st glance only. Sometimes omitting
> parentheses
> >>>> would
> >>>>>> help (which clearly means parsing issue?). All these produce
> expected
> >>>>>> output:
> >>>>>>
> >>>>>>       (2,:2) (< @: +: @: ]);.0 i. 4
> >>>>>> +---+
> >>>>>> |4 6|
> >>>>>> +---+
> >>>>>>       ((0,:2),:(2,:2)) (] @: +: @: ]);.0 i. 4
> >>>>>> 0 2
> >>>>>> 4 6
> >>>>>>       ((0,:2),:(2,:2)) (< @: +/ @: ]);.0 i. 4
> >>>>>> +-+-+
> >>>>>> |1|5|
> >>>>>> +-+-+
> >>>>>>       (0,:2) (< @: (\:~) @: ]);.0 i. 4
> >>>>>> +---+
> >>>>>> |1 0|
> >>>>>> +---+
> >>>>>>       ((0,:2),:(2,:2)) (] @: (\:~) @: ]);.0 i. 4
> >>>>>> 1 0
> >>>>>> 3 2
> >>>>>>       ((0,:2),:(2,:2)) (< @: (\:~));.0 i. 4
> >>>>>> +---+---+
> >>>>>> |1 0|3 2|
> >>>>>> +---+---+
> >>>>>>
> >>>>>>
> >>>>>> While "why would you want to use the ']' here?" would be reasonable
> >> to
> >>>> ask,
> >>>>>> but, in the end, syntax is either correct or not. In fact, I was
> >>> testing
> >>>>>> all kinds of different constructs investigating why multi-threaded
> >>>>>> operation (with "t.") on subarrays is so much slower than expected,
> >>>>>> although it's perhaps totally unrelated to what's discovered above.
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Vadim
> >>>>>>
> >> ----------------------------------------------------------------------
> >>>>>> For information about J forums see
> >> http://www.jsoftware.com/forums.htm
> >>>>>
> ----------------------------------------------------------------------
> >>>>> For information about J forums see
> >> http://www.jsoftware.com/forums.htm
> >>>> ----------------------------------------------------------------------
> >>>> For information about J forums see
> http://www.jsoftware.com/forums.htm
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >>>
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jprogramming] Version 9 sorting anomaly? (was: Re: Unexpected results (a bug?) in version 9 with some cases of cut (subarray))

Reply via email to