The array is periodically extended so the result of u;._3 is 512 by 512. Using 3 threads instead of 7 and t. on 512 three by 512 pieces takes 59 sec and J uses 10% of total cpu I'll try slicing 512 by 3 and 512 by m pieces later. Thanks for the comments, especially the insights into u;._3 !
On Thu, Jul 13, 2023 at 2:09 PM Henry Rich <henryhr...@gmail.com> wrote: > I'm surprised that the result is 512x512 when you use u;._3 on 512x512 > argument. Do you pad after the operation? > > 1. How good the operation is for threading depends on the ratio of > processing to reading arguments/writing results. The arguments start > out in a different core's cache, and have to be transferred over the > mesh to the core doing the processing. That takes dozens of cycles per > cacheline transferred; whether that's a big number or not depends on how > much work you have to do after the data arrives. I /THINK/ that each > core has an interface to the mesh that runs at about the speed of the L3 > cache on average. If anybody knows details about this, I hunger for them. > > 2. A logical processor is not a core. Two logical processors share > cache/pipeline/execution units/memory interface, and only one can > execute at a time. Again I can't find a good description of the > details, but my guess is that a logical-processor switch occurs only on > a pipeline break, i. e. a mispredicted branch. For sloppy C code with > lots of conditionals, enough cycles are lost to pipeline breaks that > it's worthwhile to have a hyperthread waiting to use them; but JE is > coded with especial care to minimize the number of mispredicted > branches. A single thread of JE will usually keep a core busy, I > reckon. We recommend creating one thread per /core/, not per /logical > processor/. Some applications can perhaps benefit from more threads > than cores, but it doesn't surprise me that yours doesn't. > > 3. u;._3 was lovingly coded to minimize data movement for image > processing. Consider a 3x3 convolution moving across a 5x5 argument. I > start by copying the first 5x3 section: > > abc > fgh > klm > pqr > uvw > > Using a virtual argument, I execute u on these 3 3x3 cells (a-m, f-r, > k-w) without moving any data. After going all the way down the column, > I copy in the next column, overwriting the first column offset down one > row: > > bc > dgh > ilm > nqr > svw > x > > by simply advancing the array pointer one column, without moving any > data, this is > > bcd > ghi > lmn > qrs > vwx > > and again I can use a virtual argument to process all the cells (b-n, > g-s, l-x) without moving any data. > > The bottom line is that only a single copy of the input argument is made. > > What I'm saying is that u;._3 is very cache-friendly which might tend to > reduce the gain from multithreading. > > It would be interesting to see how much better 7 worker threads are than 3. > > Henry Rich > > > On 7/13/2023 1:26 PM, Clifford Reiter wrote: > > Hi, > > I thought I would experiment with t. > > I choose an "image" processing problem on a 512 by 512 array. > > Local (complex) processing occurs on 3 by 3 cells (u;._3) which results > in > > a 512 by 512 array. That process is iterated (^:_), here around 150 > times. > > So I thought this might be a good place to look for a speedup using t. > > 7 threads were created as per recommendation: {{0 T. 0}}^:] _1+{.8 T. '' > > > > time (sec, via 6!:2) on the left below. > > 113 with no t. > > 41 with t. applied on 512 arrays of size 3 by 512 at each iteration > > 41 with t. applied to 7 nearly equal m by 512 blocks at each iteration > > 39 with t. applied to 14 nearly equal m by 512 blocks at each iteration > > I'm not unhappy with an almost 3x gain, but I am wondering if this is a > bad > > problem for t. ? Also, when not using t., task manager shows J using > about > > 3.8%, with t., it shows J using about 21%, other things are in the low > > single digits. I am surprised that I can't peg the cpu's near 100% . (4 > > cores, 8 logical processors, windows, J9.5 beta 4). > > Just sharing my experience and welcoming any comments. > > Best, Cliff > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm