I'm surprised that the result is 512x512 when you use u;._3 on 512x512 argument.  Do you pad after the operation?

1. How good the operation is for threading depends on the ratio of processing to reading arguments/writing results.  The arguments start out in a different core's cache, and have to be transferred over the mesh to the core doing the processing.  That takes dozens of cycles per cacheline transferred; whether that's a big number or not depends on how much work you have to do after the data arrives.  I /THINK/ that each core has an interface to the mesh that runs at about the speed of the L3 cache on average.  If anybody knows details about this, I hunger for them.

2. A logical processor is not a core.  Two logical processors share cache/pipeline/execution units/memory interface, and only one can execute at a time.  Again I can't find a good description of the details, but my guess is that a logical-processor switch occurs only on a pipeline break, i. e. a mispredicted branch.  For sloppy C code with lots of conditionals, enough cycles are lost to pipeline breaks that it's worthwhile to have a hyperthread waiting to use them; but JE is coded with especial care to minimize the number of mispredicted branches.  A single thread of JE will usually keep a core busy, I reckon.  We recommend creating one thread per /core/, not per /logical processor/.  Some applications can perhaps benefit from more threads than cores, but it doesn't surprise me that yours doesn't.

3. u;._3 was lovingly coded to minimize data movement for image processing.  Consider a 3x3 convolution moving across a 5x5 argument.  I start by copying the first 5x3 section:

abc
fgh
klm
pqr
uvw

Using a virtual argument, I execute u on these 3 3x3 cells (a-m, f-r, k-w) without moving any data.  After going all the way down the column, I copy in the next column, overwriting the first column offset down one row:

 bc
dgh
ilm
nqr
svw
x

by simply advancing the array pointer one column, without moving any data, this is

bcd
ghi
lmn
qrs
vwx

and again I can use a virtual argument to process all the cells (b-n, g-s, l-x) without moving any data.

The bottom line is that only a single copy of the input argument is made.

What I'm saying is that u;._3 is very cache-friendly which might tend to reduce the gain from multithreading.

It would be interesting to see how much better 7 worker threads are than 3.

Henry Rich


On 7/13/2023 1:26 PM, Clifford Reiter wrote:
  Hi,
I thought I would experiment with t.
I choose an "image" processing problem on a 512 by 512 array.
Local (complex) processing occurs on 3 by 3 cells (u;._3) which results in
a 512 by 512 array. That process is iterated (^:_), here around 150 times.
So I thought this might be a good place to look for a speedup using t.
7 threads were created as per recommendation: {{0 T. 0}}^:] _1+{.8 T. ''

time (sec, via 6!:2) on the left below.
113   with no t.
41    with  t. applied on 512 arrays of size 3 by 512 at each iteration
41  with t. applied to 7 nearly equal m by 512 blocks at each iteration
39  with t. applied to 14 nearly equal m by 512 blocks at each iteration
I'm not unhappy with an almost 3x gain, but I am wondering if this is a bad
problem for t. ? Also, when not using t., task manager shows J using about
3.8%, with t., it shows J using about 21%, other things are in the low
single digits. I am surprised that I can't peg the cpu's near 100% . (4
cores, 8 logical processors, windows, J9.5 beta 4).
Just sharing my experience and welcoming any comments.
Best, Cliff
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to