Re: [Jprogramming] t. experiments

Henry Rich Thu, 13 Jul 2023 11:10:00 -0700

I'm surprised that the result is 512x512 when you use u;._3 on 512x512argument. Do you pad after the operation?

1. How good the operation is for threading depends on the ratio ofprocessing to reading arguments/writing results. The arguments startout in a different core's cache, and have to be transferred over themesh to the core doing the processing. That takes dozens of cycles percacheline transferred; whether that's a big number or not depends on howmuch work you have to do after the data arrives. I /THINK/ that eachcore has an interface to the mesh that runs at about the speed of the L3cache on average. If anybody knows details about this, I hunger for them.

2. A logical processor is not a core. Two logical processors sharecache/pipeline/execution units/memory interface, and only one canexecute at a time. Again I can't find a good description of thedetails, but my guess is that a logical-processor switch occurs only ona pipeline break, i. e. a mispredicted branch. For sloppy C code withlots of conditionals, enough cycles are lost to pipeline breaks thatit's worthwhile to have a hyperthread waiting to use them; but JE iscoded with especial care to minimize the number of mispredictedbranches. A single thread of JE will usually keep a core busy, Ireckon. We recommend creating one thread per /core/, not per /logicalprocessor/. Some applications can perhaps benefit from more threadsthan cores, but it doesn't surprise me that yours doesn't.

3. u;._3 was lovingly coded to minimize data movement for imageprocessing. Consider a 3x3 convolution moving across a 5x5 argument. Istart by copying the first 5x3 section:


abc
fgh
klm
pqr
uvw

Using a virtual argument, I execute u on these 3 3x3 cells (a-m, f-r,k-w) without moving any data. After going all the way down the column,I copy in the next column, overwriting the first column offset down one row:


 bc
dgh
ilm
nqr
svw
x

by simply advancing the array pointer one column, without moving anydata, this is


bcd
ghi
lmn
qrs
vwx

and again I can use a virtual argument to process all the cells (b-n,g-s, l-x) without moving any data.


The bottom line is that only a single copy of the input argument is made.

What I'm saying is that u;._3 is very cache-friendly which might tend toreduce the gain from multithreading.


It would be interesting to see how much better 7 worker threads are than 3.

Henry Rich


On 7/13/2023 1:26 PM, Clifford Reiter wrote:

  Hi,
I thought I would experiment with t.
I choose an "image" processing problem on a 512 by 512 array.
Local (complex) processing occurs on 3 by 3 cells (u;._3) which results in
a 512 by 512 array. That process is iterated (^:_), here around 150 times.
So I thought this might be a good place to look for a speedup using t.
7 threads were created as per recommendation: {{0 T. 0}}^:] _1+{.8 T. ''

time (sec, via 6!:2) on the left below.
113   with no t.
41    with  t. applied on 512 arrays of size 3 by 512 at each iteration
41  with t. applied to 7 nearly equal m by 512 blocks at each iteration
39  with t. applied to 14 nearly equal m by 512 blocks at each iteration
I'm not unhappy with an almost 3x gain, but I am wondering if this is a bad
problem for t. ? Also, when not using t., task manager shows J using about
3.8%, with t., it shows J using about 21%, other things are in the low
single digits. I am surprised that I can't peg the cpu's near 100% . (4
cores, 8 logical processors, windows, J9.5 beta 4).
Just sharing my experience and welcoming any comments.
Best, Cliff
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm


----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] t. experiments

Reply via email to