Re: [Jprogramming] t. experiments

Clifford Reiter Fri, 14 Jul 2023 04:21:52 -0700

The array is periodically extended so the result of u;._3 is 512 by 512.
Using 3 threads instead of 7 and t. on 512 three by 512 pieces takes 59 sec
and J uses 10% of total cpu
I'll try slicing 512 by 3 and 512 by m pieces later.
Thanks for the comments, especially the insights into u;._3 !


On Thu, Jul 13, 2023 at 2:09 PM Henry Rich <[email protected]> wrote:

> I'm surprised that the result is 512x512 when you use u;._3 on 512x512
> argument.  Do you pad after the operation?
>
> 1. How good the operation is for threading depends on the ratio of
> processing to reading arguments/writing results.  The arguments start
> out in a different core's cache, and have to be transferred over the
> mesh to the core doing the processing.  That takes dozens of cycles per
> cacheline transferred; whether that's a big number or not depends on how
> much work you have to do after the data arrives.  I /THINK/ that each
> core has an interface to the mesh that runs at about the speed of the L3
> cache on average.  If anybody knows details about this, I hunger for them.
>
> 2. A logical processor is not a core.  Two logical processors share
> cache/pipeline/execution units/memory interface, and only one can
> execute at a time.  Again I can't find a good description of the
> details, but my guess is that a logical-processor switch occurs only on
> a pipeline break, i. e. a mispredicted branch.  For sloppy C code with
> lots of conditionals, enough cycles are lost to pipeline breaks that
> it's worthwhile to have a hyperthread waiting to use them; but JE is
> coded with especial care to minimize the number of mispredicted
> branches.  A single thread of JE will usually keep a core busy, I
> reckon.  We recommend creating one thread per /core/, not per /logical
> processor/.  Some applications can perhaps benefit from more threads
> than cores, but it doesn't surprise me that yours doesn't.
>
> 3. u;._3 was lovingly coded to minimize data movement for image
> processing.  Consider a 3x3 convolution moving across a 5x5 argument.  I
> start by copying the first 5x3 section:
>
> abc
> fgh
> klm
> pqr
> uvw
>
> Using a virtual argument, I execute u on these 3 3x3 cells (a-m, f-r,
> k-w) without moving any data.  After going all the way down the column,
> I copy in the next column, overwriting the first column offset down one
> row:
>
>   bc
> dgh
> ilm
> nqr
> svw
> x
>
> by simply advancing the array pointer one column, without moving any
> data, this is
>
> bcd
> ghi
> lmn
> qrs
> vwx
>
> and again I can use a virtual argument to process all the cells (b-n,
> g-s, l-x) without moving any data.
>
> The bottom line is that only a single copy of the input argument is made.
>
> What I'm saying is that u;._3 is very cache-friendly which might tend to
> reduce the gain from multithreading.
>
> It would be interesting to see how much better 7 worker threads are than 3.
>
> Henry Rich
>
>
> On 7/13/2023 1:26 PM, Clifford Reiter wrote:
> >   Hi,
> > I thought I would experiment with t.
> > I choose an "image" processing problem on a 512 by 512 array.
> > Local (complex) processing occurs on 3 by 3 cells (u;._3) which results
> in
> > a 512 by 512 array. That process is iterated (^:_), here around 150
> times.
> > So I thought this might be a good place to look for a speedup using t.
> > 7 threads were created as per recommendation: {{0 T. 0}}^:] _1+{.8 T. ''
> >
> > time (sec, via 6!:2) on the left below.
> > 113   with no t.
> > 41    with  t. applied on 512 arrays of size 3 by 512 at each iteration
> > 41  with t. applied to 7 nearly equal m by 512 blocks at each iteration
> > 39  with t. applied to 14 nearly equal m by 512 blocks at each iteration
> > I'm not unhappy with an almost 3x gain, but I am wondering if this is a
> bad
> > problem for t. ? Also, when not using t., task manager shows J using
> about
> > 3.8%, with t., it shows J using about 21%, other things are in the low
> > single digits. I am surprised that I can't peg the cpu's near 100% . (4
> > cores, 8 logical processors, windows, J9.5 beta 4).
> > Just sharing my experience and welcoming any comments.
> > Best, Cliff
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] t. experiments

Reply via email to