switchover at 11x11 or 100x100 I'm not so sure. Anyway there is some
arbitrary threshold.

IIUC 256-bit arithmetic only applies to x86_64 avx. arm64 has neon but
needs another implementation.

On Fri, May 17, 2019, 8:01 AM Henry Rich <[email protected]> wrote:

> The switchover to BLAS is at more like 100x100 IIRC; for smaller than
> that it uses a fast 1-core routine that I wrote.
>
> +/ . * is highly optimized, but for two different cases.  Say you are
> multiplying mxp times pxn to produce an mxn product.  If m, p, and n are
> big enough to allow the result to be calculated in 4x4 blocks, by
> careful management of caches the I/O can be reduced relative to the
> arithmetic, and the product can be produced as fast as the ALU can do
> the arithmetic.
>
> If n is 1, what you have is a series of dot-products.  These are
> produced with special code that uses multiple 256-bit accumulators (in
> the next beta; now there are multiple 64-bit accumulators) to produce
> each scalar result.  This code is directly invoked via +/@:*"1, but +/ .
> * switches over to it when it feels like it.
>
> Other values of m, n, and p are not as efficient because working in 4x4
> blocks has edge wastage.  If the matrices are less than 11x11 or so the
> system just evaluates as +/@:(*"1 _) like Ken defined it.
>
> If the matrices are really big the system calls BLAS, which uses similar
> techniques but can use multiple cores.
>
> Henry Rich
>
> On 5/16/2019 7:31 PM, bill lam wrote:
> > Ah I forgot, if size of matrix is small , eg less than 11x11 then it
> won't
> > call blas routine.
> >
> > On Fri, May 17, 2019, 7:22 AM bill lam <[email protected]> wrote:
> >
> >> If you use j807 with an avx capable cpu, it should call optimized blas
> for
> >> that pattern, you can compare with previous versions of J. If you want
> even
> >> faster, you can build j.dll/libj.so from source to enable multiple core
> >> support and performance will scale up with the number of core used.
> >>
> >> On Fri, May 17, 2019, 7:13 AM 'Mike Day' via Programming <
> >> [email protected]> wrote:
> >>
> >>> (Also answers Bill’s post, just in)
> >>>
> >>> think I misled you. Brian’s “dot” is more correctly the matrix product,
> >>> such as
> >>>       2 3 (+/ . *)&i. 3 4
> >>> 20 23 26 29
> >>> 56 68 80 92
> >>> so we’re talking about dot =: +/ . *
> >>>
> >>> In some cases, Brian needs to multiply an mxn matrix A, by a kxn
> matrix B
> >>> for a mxk result,
> >>> A dot |: B
> >>> In others, he needs C, shape mxn, by D, shape mxk,  for an nxk result,
> >>> (|: C) dot D
> >>> and of course, some are straight matrix multiplications.
> >>>
> >>> I defined    Tdot =: |:@:[ +/ .* ] and dotT =: dot |:
> >>>
> >>> Are matrix multiplications going to be enhanced?  And what about  such
> >>> variants as these?
> >>>
> >>> Thanks,
> >>>
> >>> Mik
> >>>
> >>> Sent from my iPad
> >>>
> >>>> On 16 May 2019, at 18:43, Henry Rich <[email protected]> wrote:
> >>>>
> >>>> In the next beta +/@:*"1 uses 256-bit instructions, which should help
> >>> with dot-products.
> >>>> Henry rich
> >>>>
> >>>>> On 5/16/2019 1:27 PM, 'Mike Day' via Programming wrote:
> >>>>> I've tried various timings and tweaks - the dot products seem to
> >>> consume the most time;
> >>>>> it's marginally worth dividing by "num_examples" after summing
> >>> "correct_logprobs" rather
> >>>>> than summing the quotient,  " correct_logprobs%num_examples "
> >>>>>
> >>>>> I added a couple of dot fns,      Tdot =: |:@[ dot ]     and dotT =:
> >>> dot |:
> >>>>> to neaten up the code a bit.  Those transposes seem unavoidable.
> >>>>>
> >>>>> In a practical application, you'd probably run cycles until either a
> >>> suitable level of convergence
> >>>>> is achieved - or until it's obvious that the process is divergent.
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Mike
> >>>>>
> >>>>>
> >>>>>> On 16/05/2019 15:20, Brian Schott wrote:
> >>>>>> Mike,
> >>>>>>
> >>>>>> Yes, I new the reason that the calculation was done, but was
> >>> surprised by
> >>>>>> the manner in which these authors applied the calculation (without
> the
> >>>>>> multiplication) and I applied the Amend incorrectly, by not
> >>> remembering
> >>>>>> that it was being applied to an array.
> >>>>>>
> >>>>>> And you are correct that the Amend approach is slower and more space
> >>>>>> consuming than the Product approach. I re-applied -- correctly, this
> >>> time,
> >>>>>> finally🤞  -- the Amend approach on a 'dbstopped' version of `train`
> >>> and
> >>>>>> got the following timings. In retrospect both methods require the
> >>> condition
> >>>>>> check and then multiplying by 0 and 1 may be very fast relative to
> >>> Amend's
> >>>>>> needs.
> >>>>>>
> >>>>>>        mnd =: 0:`(I.@(0&>:)@[)`]}"1
> >>>>>>        ((hidden_layer>0)*dscores dot|:W2)-:hidden_layer mnd dscores
> >>> dot|:W2
> >>>>>> 1
> >>>>>>        10 timespacex'(hidden_layer>0)*dscores dot|:W2'
> >>>>>> 0.0004102 301568
> >>>>>>        10 timespacex'hidden_layer mnd dscores dot|:W2'
> >>>>>> 0.0006501 535360
> >>>>>>
> >>>>>> And btw, mnd1 =: 0:`(I.@(0>:[))`]}"1  using a fork is very slightly
> >>> faster
> >>>>>> than mnd.
> >>>>>>
> >>>>>>
> >>>>>> Thanks, again,
> >>>>>>
> >>>>>> On Thu, May 16, 2019 at 5:32 AM 'Mike Day' via Programming <
> >>>>>> [email protected]> wrote:
> >>>>>>
> >>>>>>> The Python authors' comments here explain (well, they assert) why
> >>> we're
> >>>>>>> doing that filtering for hidden_layer > 0:
> >>>>>>>
> >>>>>>> " Now we have the gradient on the outputs of the hidden layer.
> Next,
> >>> we
> >>>>>>> have to backpropagate the ReLU non-linearity. This turns out to be
> >>> easy
> >>>>>>> because ReLU during the backward pass is effectively a switch.
> Since
> >>>>>>> r=max(0,x) , we have that dr/dx = 1 (x>0) . Combined with the chain
> >>>>>>> rule, we see that the ReLU unit lets the gradient pass through
> >>> unchanged
> >>>>>>> if its input was greater than 0, but kills it if its input was less
> >>> than
> >>>>>>> zero [or equal to zero - Mike's edit] during the forward pass."
> >>>>>>>
> >>>>>>> Isn't it curious that the J-way of doing it,
> >>>>>>>
> >>>>>>>       if. # ilow=. (<"1@:($ #: I.@:(0 >: ,))) hidden_layer do.
> NB.
> >>> find
> >>>>>>> indices of elements <: 0
> >>>>>>>          dhidden =. 0 ilow } dhidden
> >>>>>>>       end.
> >>>>>>>
> >>>>>>> is much slower than the naive
> >>>>>>>
> >>>>>>>       dhidden =. (hidden_layer >0) * dscores dotT  W2
> >>>>>>> ?
> >>>>>>>
> >>>>>>> Mike
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>> (B=)
> >>>>>>
> ----------------------------------------------------------------------
> >>>>>> For information about J forums see
> >>> http://www.jsoftware.com/forums.htm
> >>>>> ---
> >>>>> This email has been checked for viruses by Avast antivirus software.
> >>>>> https://www.avast.com/antivirus
> >>>>>
> >>>>>
> ----------------------------------------------------------------------
> >>>>> For information about J forums see
> http://www.jsoftware.com/forums.htm
> >>>>
> >>>> ---
> >>>> This email has been checked for viruses by AVG.
> >>>> https://www.avg.com
> >>>>
> >>>> ----------------------------------------------------------------------
> >>>> For information about J forums see
> http://www.jsoftware.com/forums.htm
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >>>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to