Hi Vadim, What result do you get when you use this syntax?
; cut_by /:~t.(<'worker';1);.0 ints Ak. On Thu., Jan. 26, 2023, 08:46 Henry Rich, <henryhr...@gmail.com> wrote: > 1. Make sure you turn off all browsers etc. when you time multrithreaded > code. > > 2. DO NOT give an x argument to 6!:2. Repeated execution of the same > sentence may tend to migrate values into the core they will be used in, > which will underestimate the time required for a single run. If you > have to time anything complicated you may have to put in your own timing > points. > > 3. 'Thread overhead' is more than just data movement. It includes time > required to lock variable-name lookups and contention for D3$. The sort > of 8-byte literals uses a merge sort IIRC, while the sort of integers > uses a lovingly-coded quicksort. The quicksort is faster, as you can > see. Mergesort is very cache-friendly: it reads and writes to memory > sequentially. Quicksort hops around, storing single words into two > different streams. On a single thread that doesn't matter much, but > with multiple threads and arguments far larger than D3$, the writing > spills to DRAM: mergesort needs only one open DRAM page per thread while > quicksort needs two per thread. That will have to make a difference; > how much I don't know. > > 4. If you want to split sorting into tasks, split it so that each task > fits into D2$ of a core. With 10M numbers (80MB) and a machine with 1MB > D2$, let each task have 125K numbers. Experiment with smaller blocks. > When you have the right size, the thread overhead will be comparable to > the time spent making two copies of the whole array. > > 5. If you find that you can reduce the overhead that far, we could have > a faster way to sort in J using multithreads: split the array into as > many pieces as needed and sort them efficiently in threads, then > transfer the data back to the master thread which merges the ordered > strings. The transfer back to the master thread is unavoidable, but the > time required to merge (a reheaping operation) would be completely > hidden under the transfer time. > > 6. Currently, starting a thread realizes any virtual argument in the > master thread. It would be better to realize it in the worker thread > where the data is needed. > > Henry Rich > > On 1/25/2023 6:11 PM, vadim wrote: > > Thank you for fixing the issue and for explanation. I would appreciate it > > if you could look into the performance issue I'm observing with my > > multi-threaded example (reduced to uselessness for sake of presentation, > > though). I will of course accept "that's how things are with CPUs/memory" > > and adjust expectations accordingly, and I understand results depend on > > hardware and dataset very much. Too small a chance it's J issue. > > > > The question concerns sorting data (I'm not doing the '+/' in threads > :)). > > I have 4 cores CPU, so I start 3 additional threads: > > > > {{ for. i. 3 do. 0 T. 0 end. }} '' > > > > Suppose I have literal data, 10 million rows 8 bytes each: > > > > lits =: a. {~ ? ((1 * 10^7), 8) $ 256 > > q =: >. -: -: # lits > > cut_by =: q ,:~ "0 q * i. 4 > > quarter =: q {. lits > > > > 4 (6!:2) '/:~ lits' > > 1.37698 > > 4 (6!:2) '/:~ quarter' > > 0.308181 > > 4 (6!:2) '; cut_by /:~t.(0$0);.0 lits' > > 0.414603 > > > > Excellent, times I see match nicely, I understand there's overhead with > > threads. Next, there are 10 millions of 8-byte integers: > > > > ints =: ? (1 * 10^7) $ (10^18) > > q =: >. -: -: # ints > > cut_by =: q ,:~ "0 q * i. 4 > > quarter =: q {. ints > > > > 4 (6!:2) '/:~ ints' > > 0.561057 > > 4 (6!:2) '/:~ quarter' > > 0.124735 > > 4 (6!:2) '; cut_by /:~t.(0$0);.0 ints' > > 0.441807 > > > > And I don't like this third time. There's roughly the same amount of data > > for "threads overhead". My expectation for this result was approx. > 150-200 > > ms or so. > > > > In fact, looking at 415 ms for literals and 442 ms for numbers irked me > > into temptation: > > > > head =. (3&(3!:4) 16be2), ,|."1 (3&(3!:4)"0) 4,q,1,q > > > > _8 ]\ a. i. head > > > > 226 0 0 0 0 0 0 0 > > > > 0 0 0 0 0 0 0 4 > > > > 0 0 0 0 0 38 37 160 > > > > 0 0 0 0 0 0 0 1 > > > > 0 0 0 0 0 38 37 160 > > > > > > > > to_bytes =: 5&}. @: (_8&(]\)) @: (2&(3!:1)) > > > > from_bytes =: (3!:2) @: (head&,) @: , > > > > > > (from_bytes /:~ to_bytes quarter) -: (/:~ quarter) NB. sane still? > > > > 1 > > > > > > 4 (6!:2) ';cut_by (from_bytes @: /:~ @: to_bytes)t.(0$0);.0 ints' > > > > 0.51177 > > > > > > Ah, it didn't work. But perhaps it could with certain data, CPU model, > > number of cores? So my question is if you could confirm that it (slower > > than I expected speed with numerical sort in threads) is neither J issue, > > nor 't.', nor '/:~'. Sorry if I wasted your time. > > > > Best regards, > > Vadim > > > > > > On Wed, Jan 25, 2023 at 8:02 PM Henry Rich <henryhr...@gmail.com> wrote: > >> Fixed for the release. Thanks for the clear report. The problem was > >> specific to the forms you mentioned. Workaround: use > >> > >> < @: ((+/) @:]) instead of < @: (+/) @:] > >> > >> The form <@:f is given special treatment. Your form was incorrectly > >> being given that treatment. > >> > >> > >> If t. in cut is not meeting your expectations, perhaps you should adjust > >> your expectations. Verbs like (+/) will not benefit from threading in > >> most cases, and may slow down considerably. +/;.0 might be even worse. > >> > >> Why? Because +/ is totally limited by the speed of reading from > >> memory. If the data fits in level-2 data cache (D2$) many cores are no > >> faster than one. > >> > >> In fact they are much slower, because only one core has the data in > >> D2$. The rest have to transfer the data from the bottom of the ocean > >> (i. e. from the core with the data through D3$) or from the moon > >> (SDRAM). They are spending their time waiting for responses from > memory. > >> > >> +/;.0 creates a virtual block for each section and passes that to +/ . > >> There is no need to move the data except for the reading required by +/ > >> . If you run the +/ in a thread, the virtual block must be realized > >> with an explicit copy from the bottom of the ocean. That doesn't add > >> much, because once the copy is made the data will be in D2$ of the > >> receiving core, but it is a small slowdown. > >> > >> A thread needs to be able to run in its own core until it has done > >> reads+writes to D1$/D2$ at least, say, 100 times the size of its > >> arguments+result. +/ . * is a perfect example. On large matrices the > >> arguments are cycled through repeatedly. > >> > >> Henry Rich > >> > >> On 1/25/2023 7:08 AM, vadim wrote: > >>> Hi, please consider this: > >>> > >>> ((0,:2),:(2,:2)) (< @: +: @: ]);.0 i. 4 > >>> +---+---+ > >>> |0 1|2 3| > >>> +---+---+ > >>> ((0,:2),:(2,:2)) (< @: (+/) @: ]);.0 i. 4 > >>> +---+---+ > >>> |0 1|2 3| > >>> +---+---+ > >>> ((0,:2),:(2,:2)) (< @: (\:~) @: ]);.0 i. 4 > >>> +---+---+ > >>> |0 1|2 3| > >>> +---+---+ > >>> > >>> > >>> No issues in 8.07; and a bug (that's what I'd call it) in 9.03 and > 9.04. > >>> Looks like it happens if the left arg has multiple ranges; and a verb > to > >>> apply is composed with "same" and "box" verbs as first and last in > >>> sequence. But it's at 1st glance only. Sometimes omitting parentheses > > would > >>> help (which clearly means parsing issue?). All these produce expected > >>> output: > >>> > >>> (2,:2) (< @: +: @: ]);.0 i. 4 > >>> +---+ > >>> |4 6| > >>> +---+ > >>> ((0,:2),:(2,:2)) (] @: +: @: ]);.0 i. 4 > >>> 0 2 > >>> 4 6 > >>> ((0,:2),:(2,:2)) (< @: +/ @: ]);.0 i. 4 > >>> +-+-+ > >>> |1|5| > >>> +-+-+ > >>> (0,:2) (< @: (\:~) @: ]);.0 i. 4 > >>> +---+ > >>> |1 0| > >>> +---+ > >>> ((0,:2),:(2,:2)) (] @: (\:~) @: ]);.0 i. 4 > >>> 1 0 > >>> 3 2 > >>> ((0,:2),:(2,:2)) (< @: (\:~));.0 i. 4 > >>> +---+---+ > >>> |1 0|3 2| > >>> +---+---+ > >>> > >>> > >>> While "why would you want to use the ']' here?" would be reasonable to > > ask, > >>> but, in the end, syntax is either correct or not. In fact, I was > testing > >>> all kinds of different constructs investigating why multi-threaded > >>> operation (with "t.") on subarrays is so much slower than expected, > >>> although it's perhaps totally unrelated to what's discovered above. > >>> > >>> Best regards, > >>> Vadim > >>> ---------------------------------------------------------------------- > >>> For information about J forums see http://www.jsoftware.com/forums.htm > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm