So I tried to begin to implement "split into small sub-tasks", but got stuck before any meaningful step, at what may be another version 9 issue (?)
Though I remember " DO NOT give an x argument to 6!:2", in tests below there is x argument to catch solid time. Otherwise, I could cut 10^7 numbers into ~100 ranges to sort each (closer to my real task), then there will be no x argument, with approx. the same "solid" time difference caught (decimal point moved to the right 2 places of course), but cutting would mask the real issue. 100 (6!:2) '/:~ nn' [ nn =: ? N $ 10^9 [ N =: 100000 0.00263482 100 (6!:2) '/:~ nn' [ nn =: ? N $ 10^9 [ N =: 99999 0.0073261 100 (6!:2) '}./:~ nn,_1' [ nn =: ? N $ 10^9 [ N =: 99999 0.00298486 If I reboot into Linux, it's not 3x, but 4x difference between tests 1 and 2. Test 3 is like ridiculous advice "don't sort 99999 (nor 99000, perhaps) numbers; instead pad with throw-away values to 10^5 numbers. I hope my result is not CPU-model specific, and you can observe something similar. No issue in 8.07. Best regards, Vadim On Fri, Jan 27, 2023 at 6:18 PM Henry Rich <[email protected]> wrote: > J9.04 is frozen for the release, but J9.05 will follow. This thread has > convinced me that sorting large arrays can be greatly improved by > multithreading, and also that I can reduce the threading overhead. > > Henry Rich > > On 1/27/2023 6:07 AM, vadim wrote: > >> What result do you get when you use this syntax? > > "Don't use master thread, use worker threads only", correct? So, I'm > > allocating 3 workers for 4 equal pieces of work; 3 workers finish doing > > their task each, then 2 of them are idle, and 1 alone does the 4th part? > > > > Interestingly, with literals I get predictable time increase from ~400 ms > > to ~650 ms. Which concurs, more or less. With integers, I get the same > ~440 > > ms. How very strange. And it's the same ~420 ... 440 ms if I cut ints > into > > 2 or 3 parts (master thread is idle then). > > > > ------------- > > I don't know if "parallelizable primitives" (including "sort") will > arrive > > with 9.04 release, and maybe what follows was already mentioned in > > community, but with sorting literals there's effective and very lazy way > to > > gain speed through using threads as they are implemented now: > > > > (6!:2) '/:~ lits' > > > > 1.37036 > > > > (6!:2) '/:~ ;cut_by /:~t.(0$0);.0 lits' > > > > 0.665364 > > > > 1 T. '' > > > > 3 > > ------------- > > Henry, I only have shallow understanding, my practical point of view was > > "try to divide work into preferably large and easy to manage parts, then > > oversee the whole gang of workers are busy and none of them slacks". > Thank > > you very much for the detailed explanation, this cutting into very small > > pieces is new to me, I'll now see if I can implement it in practice. > > Sorting is only part of a larger whole; it's just that I noticed > > interesting results while experimenting. > > > > > > Best regards, > > Vadim > > > > > > On Fri, Jan 27, 2023 at 12:11 AM Ak O <[email protected]> wrote: > > > >> Hi Vadim, > >> > >> What result do you get when you use this syntax? > >> > >> ; cut_by /:~t.(<'worker';1);.0 ints > >> > >> Ak. > >> > >> > >> On Thu., Jan. 26, 2023, 08:46 Henry Rich, <[email protected]> wrote: > >> > >>> 1. Make sure you turn off all browsers etc. when you time > multrithreaded > >>> code. > >>> > >>> 2. DO NOT give an x argument to 6!:2. Repeated execution of the same > >>> sentence may tend to migrate values into the core they will be used in, > >>> which will underestimate the time required for a single run. If you > >>> have to time anything complicated you may have to put in your own > timing > >>> points. > >>> > >>> 3. 'Thread overhead' is more than just data movement. It includes time > >>> required to lock variable-name lookups and contention for D3$. The > sort > >>> of 8-byte literals uses a merge sort IIRC, while the sort of integers > >>> uses a lovingly-coded quicksort. The quicksort is faster, as you can > >>> see. Mergesort is very cache-friendly: it reads and writes to memory > >>> sequentially. Quicksort hops around, storing single words into two > >>> different streams. On a single thread that doesn't matter much, but > >>> with multiple threads and arguments far larger than D3$, the writing > >>> spills to DRAM: mergesort needs only one open DRAM page per thread > while > >>> quicksort needs two per thread. That will have to make a difference; > >>> how much I don't know. > >>> > >>> 4. If you want to split sorting into tasks, split it so that each task > >>> fits into D2$ of a core. With 10M numbers (80MB) and a machine with > 1MB > >>> D2$, let each task have 125K numbers. Experiment with smaller blocks. > >>> When you have the right size, the thread overhead will be comparable to > >>> the time spent making two copies of the whole array. > >>> > >>> 5. If you find that you can reduce the overhead that far, we could have > >>> a faster way to sort in J using multithreads: split the array into as > >>> many pieces as needed and sort them efficiently in threads, then > >>> transfer the data back to the master thread which merges the ordered > >>> strings. The transfer back to the master thread is unavoidable, but > the > >>> time required to merge (a reheaping operation) would be completely > >>> hidden under the transfer time. > >>> > >>> 6. Currently, starting a thread realizes any virtual argument in the > >>> master thread. It would be better to realize it in the worker thread > >>> where the data is needed. > >>> > >>> Henry Rich > >>> > >>> On 1/25/2023 6:11 PM, vadim wrote: > >>>> Thank you for fixing the issue and for explanation. I would appreciate > >> it > >>>> if you could look into the performance issue I'm observing with my > >>>> multi-threaded example (reduced to uselessness for sake of > >> presentation, > >>>> though). I will of course accept "that's how things are with > >> CPUs/memory" > >>>> and adjust expectations accordingly, and I understand results depend > on > >>>> hardware and dataset very much. Too small a chance it's J issue. > >>>> > >>>> The question concerns sorting data (I'm not doing the '+/' in threads > >>> :)). > >>>> I have 4 cores CPU, so I start 3 additional threads: > >>>> > >>>> {{ for. i. 3 do. 0 T. 0 end. }} '' > >>>> > >>>> Suppose I have literal data, 10 million rows 8 bytes each: > >>>> > >>>> lits =: a. {~ ? ((1 * 10^7), 8) $ 256 > >>>> q =: >. -: -: # lits > >>>> cut_by =: q ,:~ "0 q * i. 4 > >>>> quarter =: q {. lits > >>>> > >>>> 4 (6!:2) '/:~ lits' > >>>> 1.37698 > >>>> 4 (6!:2) '/:~ quarter' > >>>> 0.308181 > >>>> 4 (6!:2) '; cut_by /:~t.(0$0);.0 lits' > >>>> 0.414603 > >>>> > >>>> Excellent, times I see match nicely, I understand there's overhead > with > >>>> threads. Next, there are 10 millions of 8-byte integers: > >>>> > >>>> ints =: ? (1 * 10^7) $ (10^18) > >>>> q =: >. -: -: # ints > >>>> cut_by =: q ,:~ "0 q * i. 4 > >>>> quarter =: q {. ints > >>>> > >>>> 4 (6!:2) '/:~ ints' > >>>> 0.561057 > >>>> 4 (6!:2) '/:~ quarter' > >>>> 0.124735 > >>>> 4 (6!:2) '; cut_by /:~t.(0$0);.0 ints' > >>>> 0.441807 > >>>> > >>>> And I don't like this third time. There's roughly the same amount of > >> data > >>>> for "threads overhead". My expectation for this result was approx. > >>> 150-200 > >>>> ms or so. > >>>> > >>>> In fact, looking at 415 ms for literals and 442 ms for numbers irked > me > >>>> into temptation: > >>>> > >>>> head =. (3&(3!:4) 16be2), ,|."1 (3&(3!:4)"0) 4,q,1,q > >>>> > >>>> _8 ]\ a. i. head > >>>> > >>>> 226 0 0 0 0 0 0 0 > >>>> > >>>> 0 0 0 0 0 0 0 4 > >>>> > >>>> 0 0 0 0 0 38 37 160 > >>>> > >>>> 0 0 0 0 0 0 0 1 > >>>> > >>>> 0 0 0 0 0 38 37 160 > >>>> > >>>> > >>>> > >>>> to_bytes =: 5&}. @: (_8&(]\)) @: (2&(3!:1)) > >>>> > >>>> from_bytes =: (3!:2) @: (head&,) @: , > >>>> > >>>> > >>>> (from_bytes /:~ to_bytes quarter) -: (/:~ quarter) NB. sane still? > >>>> > >>>> 1 > >>>> > >>>> > >>>> 4 (6!:2) ';cut_by (from_bytes @: /:~ @: to_bytes)t.(0$0);.0 ints' > >>>> > >>>> 0.51177 > >>>> > >>>> > >>>> Ah, it didn't work. But perhaps it could with certain data, CPU model, > >>>> number of cores? So my question is if you could confirm that it > (slower > >>>> than I expected speed with numerical sort in threads) is neither J > >> issue, > >>>> nor 't.', nor '/:~'. Sorry if I wasted your time. > >>>> > >>>> Best regards, > >>>> Vadim > >>>> > >>>> > >>>> On Wed, Jan 25, 2023 at 8:02 PM Henry Rich <[email protected]> > >> wrote: > >>>>> Fixed for the release. Thanks for the clear report. The problem was > >>>>> specific to the forms you mentioned. Workaround: use > >>>>> > >>>>> < @: ((+/) @:]) instead of < @: (+/) @:] > >>>>> > >>>>> The form <@:f is given special treatment. Your form was incorrectly > >>>>> being given that treatment. > >>>>> > >>>>> > >>>>> If t. in cut is not meeting your expectations, perhaps you should > >> adjust > >>>>> your expectations. Verbs like (+/) will not benefit from threading > in > >>>>> most cases, and may slow down considerably. +/;.0 might be even > >> worse. > >>>>> Why? Because +/ is totally limited by the speed of reading from > >>>>> memory. If the data fits in level-2 data cache (D2$) many cores are > >> no > >>>>> faster than one. > >>>>> > >>>>> In fact they are much slower, because only one core has the data in > >>>>> D2$. The rest have to transfer the data from the bottom of the ocean > >>>>> (i. e. from the core with the data through D3$) or from the moon > >>>>> (SDRAM). They are spending their time waiting for responses from > >>> memory. > >>>>> +/;.0 creates a virtual block for each section and passes that to +/ > . > >>>>> There is no need to move the data except for the reading required by > >> +/ > >>>>> . If you run the +/ in a thread, the virtual block must be realized > >>>>> with an explicit copy from the bottom of the ocean. That doesn't add > >>>>> much, because once the copy is made the data will be in D2$ of the > >>>>> receiving core, but it is a small slowdown. > >>>>> > >>>>> A thread needs to be able to run in its own core until it has done > >>>>> reads+writes to D1$/D2$ at least, say, 100 times the size of its > >>>>> arguments+result. +/ . * is a perfect example. On large matrices > the > >>>>> arguments are cycled through repeatedly. > >>>>> > >>>>> Henry Rich > >>>>> > >>>>> On 1/25/2023 7:08 AM, vadim wrote: > >>>>>> Hi, please consider this: > >>>>>> > >>>>>> ((0,:2),:(2,:2)) (< @: +: @: ]);.0 i. 4 > >>>>>> +---+---+ > >>>>>> |0 1|2 3| > >>>>>> +---+---+ > >>>>>> ((0,:2),:(2,:2)) (< @: (+/) @: ]);.0 i. 4 > >>>>>> +---+---+ > >>>>>> |0 1|2 3| > >>>>>> +---+---+ > >>>>>> ((0,:2),:(2,:2)) (< @: (\:~) @: ]);.0 i. 4 > >>>>>> +---+---+ > >>>>>> |0 1|2 3| > >>>>>> +---+---+ > >>>>>> > >>>>>> > >>>>>> No issues in 8.07; and a bug (that's what I'd call it) in 9.03 and > >>> 9.04. > >>>>>> Looks like it happens if the left arg has multiple ranges; and a > verb > >>> to > >>>>>> apply is composed with "same" and "box" verbs as first and last in > >>>>>> sequence. But it's at 1st glance only. Sometimes omitting > parentheses > >>>> would > >>>>>> help (which clearly means parsing issue?). All these produce > expected > >>>>>> output: > >>>>>> > >>>>>> (2,:2) (< @: +: @: ]);.0 i. 4 > >>>>>> +---+ > >>>>>> |4 6| > >>>>>> +---+ > >>>>>> ((0,:2),:(2,:2)) (] @: +: @: ]);.0 i. 4 > >>>>>> 0 2 > >>>>>> 4 6 > >>>>>> ((0,:2),:(2,:2)) (< @: +/ @: ]);.0 i. 4 > >>>>>> +-+-+ > >>>>>> |1|5| > >>>>>> +-+-+ > >>>>>> (0,:2) (< @: (\:~) @: ]);.0 i. 4 > >>>>>> +---+ > >>>>>> |1 0| > >>>>>> +---+ > >>>>>> ((0,:2),:(2,:2)) (] @: (\:~) @: ]);.0 i. 4 > >>>>>> 1 0 > >>>>>> 3 2 > >>>>>> ((0,:2),:(2,:2)) (< @: (\:~));.0 i. 4 > >>>>>> +---+---+ > >>>>>> |1 0|3 2| > >>>>>> +---+---+ > >>>>>> > >>>>>> > >>>>>> While "why would you want to use the ']' here?" would be reasonable > >> to > >>>> ask, > >>>>>> but, in the end, syntax is either correct or not. In fact, I was > >>> testing > >>>>>> all kinds of different constructs investigating why multi-threaded > >>>>>> operation (with "t.") on subarrays is so much slower than expected, > >>>>>> although it's perhaps totally unrelated to what's discovered above. > >>>>>> > >>>>>> Best regards, > >>>>>> Vadim > >>>>>> > >> ---------------------------------------------------------------------- > >>>>>> For information about J forums see > >> http://www.jsoftware.com/forums.htm > >>>>> > ---------------------------------------------------------------------- > >>>>> For information about J forums see > >> http://www.jsoftware.com/forums.htm > >>>> ---------------------------------------------------------------------- > >>>> For information about J forums see > http://www.jsoftware.com/forums.htm > >>> ---------------------------------------------------------------------- > >>> For information about J forums see http://www.jsoftware.com/forums.htm > >>> > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > >> > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
