I'm surprised that the result is 512x512 when you use u;._3 on 512x512
argument. Do you pad after the operation?
1. How good the operation is for threading depends on the ratio of
processing to reading arguments/writing results. The arguments start
out in a different core's cache, and have to be transferred over the
mesh to the core doing the processing. That takes dozens of cycles per
cacheline transferred; whether that's a big number or not depends on how
much work you have to do after the data arrives. I /THINK/ that each
core has an interface to the mesh that runs at about the speed of the L3
cache on average. If anybody knows details about this, I hunger for them.
2. A logical processor is not a core. Two logical processors share
cache/pipeline/execution units/memory interface, and only one can
execute at a time. Again I can't find a good description of the
details, but my guess is that a logical-processor switch occurs only on
a pipeline break, i. e. a mispredicted branch. For sloppy C code with
lots of conditionals, enough cycles are lost to pipeline breaks that
it's worthwhile to have a hyperthread waiting to use them; but JE is
coded with especial care to minimize the number of mispredicted
branches. A single thread of JE will usually keep a core busy, I
reckon. We recommend creating one thread per /core/, not per /logical
processor/. Some applications can perhaps benefit from more threads
than cores, but it doesn't surprise me that yours doesn't.
3. u;._3 was lovingly coded to minimize data movement for image
processing. Consider a 3x3 convolution moving across a 5x5 argument. I
start by copying the first 5x3 section:
abc
fgh
klm
pqr
uvw
Using a virtual argument, I execute u on these 3 3x3 cells (a-m, f-r,
k-w) without moving any data. After going all the way down the column,
I copy in the next column, overwriting the first column offset down one row:
bc
dgh
ilm
nqr
svw
x
by simply advancing the array pointer one column, without moving any
data, this is
bcd
ghi
lmn
qrs
vwx
and again I can use a virtual argument to process all the cells (b-n,
g-s, l-x) without moving any data.
The bottom line is that only a single copy of the input argument is made.
What I'm saying is that u;._3 is very cache-friendly which might tend to
reduce the gain from multithreading.
It would be interesting to see how much better 7 worker threads are than 3.
Henry Rich
On 7/13/2023 1:26 PM, Clifford Reiter wrote:
Hi,
I thought I would experiment with t.
I choose an "image" processing problem on a 512 by 512 array.
Local (complex) processing occurs on 3 by 3 cells (u;._3) which results in
a 512 by 512 array. That process is iterated (^:_), here around 150 times.
So I thought this might be a good place to look for a speedup using t.
7 threads were created as per recommendation: {{0 T. 0}}^:] _1+{.8 T. ''
time (sec, via 6!:2) on the left below.
113 with no t.
41 with t. applied on 512 arrays of size 3 by 512 at each iteration
41 with t. applied to 7 nearly equal m by 512 blocks at each iteration
39 with t. applied to 14 nearly equal m by 512 blocks at each iteration
I'm not unhappy with an almost 3x gain, but I am wondering if this is a bad
problem for t. ? Also, when not using t., task manager shows J using about
3.8%, with t., it shows J using about 21%, other things are in the low
single digits. I am surprised that I can't peg the cpu's near 100% . (4
cores, 8 logical processors, windows, J9.5 beta 4).
Just sharing my experience and welcoming any comments.
Best, Cliff
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm