I think fast 1-core code is non-existent for non-avx cpu, so it should only
switchover for avx cpu.

On Fri, May 17, 2019, 8:20 AM Henry Rich <[email protected]> wrote:

> The code sizes the problem by calculating m*p*n, and
>
> m*p*n<=1000, it uses +/@:(*"1 _)
> 1000<m*p*n<=5000000, it uses fast 1-core code
> 5000000<m*p*n, it uses BLAS
>
> Maybe some ARM user will write a Neon version.
>
> Henry Rich
>
>
> On 5/16/2019 8:12 PM, bill lam wrote:
> > switchover at 11x11 or 100x100 I'm not so sure. Anyway there is some
> > arbitrary threshold.
> >
> > IIUC 256-bit arithmetic only applies to x86_64 avx. arm64 has neon but
> > needs another implementation.
> >
> > On Fri, May 17, 2019, 8:01 AM Henry Rich <[email protected]> wrote:
> >
> >> The switchover to BLAS is at more like 100x100 IIRC; for smaller than
> >> that it uses a fast 1-core routine that I wrote.
> >>
> >> +/ . * is highly optimized, but for two different cases.  Say you are
> >> multiplying mxp times pxn to produce an mxn product.  If m, p, and n are
> >> big enough to allow the result to be calculated in 4x4 blocks, by
> >> careful management of caches the I/O can be reduced relative to the
> >> arithmetic, and the product can be produced as fast as the ALU can do
> >> the arithmetic.
> >>
> >> If n is 1, what you have is a series of dot-products.  These are
> >> produced with special code that uses multiple 256-bit accumulators (in
> >> the next beta; now there are multiple 64-bit accumulators) to produce
> >> each scalar result.  This code is directly invoked via +/@:*"1, but +/ .
> >> * switches over to it when it feels like it.
> >>
> >> Other values of m, n, and p are not as efficient because working in 4x4
> >> blocks has edge wastage.  If the matrices are less than 11x11 or so the
> >> system just evaluates as +/@:(*"1 _) like Ken defined it.
> >>
> >> If the matrices are really big the system calls BLAS, which uses similar
> >> techniques but can use multiple cores.
> >>
> >> Henry Rich
> >>
> >> On 5/16/2019 7:31 PM, bill lam wrote:
> >>> Ah I forgot, if size of matrix is small , eg less than 11x11 then it
> >> won't
> >>> call blas routine.
> >>>
> >>> On Fri, May 17, 2019, 7:22 AM bill lam <[email protected]> wrote:
> >>>
> >>>> If you use j807 with an avx capable cpu, it should call optimized blas
> >> for
> >>>> that pattern, you can compare with previous versions of J. If you want
> >> even
> >>>> faster, you can build j.dll/libj.so from source to enable multiple
> core
> >>>> support and performance will scale up with the number of core used.
> >>>>
> >>>> On Fri, May 17, 2019, 7:13 AM 'Mike Day' via Programming <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> (Also answers Bill’s post, just in)
> >>>>>
> >>>>> think I misled you. Brian’s “dot” is more correctly the matrix
> product,
> >>>>> such as
> >>>>>        2 3 (+/ . *)&i. 3 4
> >>>>> 20 23 26 29
> >>>>> 56 68 80 92
> >>>>> so we’re talking about dot =: +/ . *
> >>>>>
> >>>>> In some cases, Brian needs to multiply an mxn matrix A, by a kxn
> >> matrix B
> >>>>> for a mxk result,
> >>>>> A dot |: B
> >>>>> In others, he needs C, shape mxn, by D, shape mxk,  for an nxk
> result,
> >>>>> (|: C) dot D
> >>>>> and of course, some are straight matrix multiplications.
> >>>>>
> >>>>> I defined    Tdot =: |:@:[ +/ .* ] and dotT =: dot |:
> >>>>>
> >>>>> Are matrix multiplications going to be enhanced?  And what about
> such
> >>>>> variants as these?
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Mik
> >>>>>
> >>>>> Sent from my iPad
> >>>>>
> >>>>>> On 16 May 2019, at 18:43, Henry Rich <[email protected]> wrote:
> >>>>>>
> >>>>>> In the next beta +/@:*"1 uses 256-bit instructions, which should
> help
> >>>>> with dot-products.
> >>>>>> Henry rich
> >>>>>>
> >>>>>>> On 5/16/2019 1:27 PM, 'Mike Day' via Programming wrote:
> >>>>>>> I've tried various timings and tweaks - the dot products seem to
> >>>>> consume the most time;
> >>>>>>> it's marginally worth dividing by "num_examples" after summing
> >>>>> "correct_logprobs" rather
> >>>>>>> than summing the quotient,  " correct_logprobs%num_examples "
> >>>>>>>
> >>>>>>> I added a couple of dot fns,      Tdot =: |:@[ dot ]     and dotT
> =:
> >>>>> dot |:
> >>>>>>> to neaten up the code a bit.  Those transposes seem unavoidable.
> >>>>>>>
> >>>>>>> In a practical application, you'd probably run cycles until either
> a
> >>>>> suitable level of convergence
> >>>>>>> is achieved - or until it's obvious that the process is divergent.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> Mike
> >>>>>>>
> >>>>>>>
> >>>>>>>> On 16/05/2019 15:20, Brian Schott wrote:
> >>>>>>>> Mike,
> >>>>>>>>
> >>>>>>>> Yes, I new the reason that the calculation was done, but was
> >>>>> surprised by
> >>>>>>>> the manner in which these authors applied the calculation (without
> >> the
> >>>>>>>> multiplication) and I applied the Amend incorrectly, by not
> >>>>> remembering
> >>>>>>>> that it was being applied to an array.
> >>>>>>>>
> >>>>>>>> And you are correct that the Amend approach is slower and more
> space
> >>>>>>>> consuming than the Product approach. I re-applied -- correctly,
> this
> >>>>> time,
> >>>>>>>> finally🤞  -- the Amend approach on a 'dbstopped' version of
> `train`
> >>>>> and
> >>>>>>>> got the following timings. In retrospect both methods require the
> >>>>> condition
> >>>>>>>> check and then multiplying by 0 and 1 may be very fast relative to
> >>>>> Amend's
> >>>>>>>> needs.
> >>>>>>>>
> >>>>>>>>         mnd =: 0:`(I.@(0&>:)@[)`]}"1
> >>>>>>>>         ((hidden_layer>0)*dscores dot|:W2)-:hidden_layer mnd
> dscores
> >>>>> dot|:W2
> >>>>>>>> 1
> >>>>>>>>         10 timespacex'(hidden_layer>0)*dscores dot|:W2'
> >>>>>>>> 0.0004102 301568
> >>>>>>>>         10 timespacex'hidden_layer mnd dscores dot|:W2'
> >>>>>>>> 0.0006501 535360
> >>>>>>>>
> >>>>>>>> And btw, mnd1 =: 0:`(I.@(0>:[))`]}"1  using a fork is very
> slightly
> >>>>> faster
> >>>>>>>> than mnd.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks, again,
> >>>>>>>>
> >>>>>>>> On Thu, May 16, 2019 at 5:32 AM 'Mike Day' via Programming <
> >>>>>>>> [email protected]> wrote:
> >>>>>>>>
> >>>>>>>>> The Python authors' comments here explain (well, they assert) why
> >>>>> we're
> >>>>>>>>> doing that filtering for hidden_layer > 0:
> >>>>>>>>>
> >>>>>>>>> " Now we have the gradient on the outputs of the hidden layer.
> >> Next,
> >>>>> we
> >>>>>>>>> have to backpropagate the ReLU non-linearity. This turns out to
> be
> >>>>> easy
> >>>>>>>>> because ReLU during the backward pass is effectively a switch.
> >> Since
> >>>>>>>>> r=max(0,x) , we have that dr/dx = 1 (x>0) . Combined with the
> chain
> >>>>>>>>> rule, we see that the ReLU unit lets the gradient pass through
> >>>>> unchanged
> >>>>>>>>> if its input was greater than 0, but kills it if its input was
> less
> >>>>> than
> >>>>>>>>> zero [or equal to zero - Mike's edit] during the forward pass."
> >>>>>>>>>
> >>>>>>>>> Isn't it curious that the J-way of doing it,
> >>>>>>>>>
> >>>>>>>>>        if. # ilow=. (<"1@:($ #: I.@:(0 >: ,))) hidden_layer do.
> >> NB.
> >>>>> find
> >>>>>>>>> indices of elements <: 0
> >>>>>>>>>           dhidden =. 0 ilow } dhidden
> >>>>>>>>>        end.
> >>>>>>>>>
> >>>>>>>>> is much slower than the naive
> >>>>>>>>>
> >>>>>>>>>        dhidden =. (hidden_layer >0) * dscores dotT  W2
> >>>>>>>>> ?
> >>>>>>>>>
> >>>>>>>>> Mike
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>> (B=)
> >>>>>>>>
> >> ----------------------------------------------------------------------
> >>>>>>>> For information about J forums see
> >>>>> http://www.jsoftware.com/forums.htm
> >>>>>>> ---
> >>>>>>> This email has been checked for viruses by Avast antivirus
> software.
> >>>>>>> https://www.avast.com/antivirus
> >>>>>>>
> >>>>>>>
> >> ----------------------------------------------------------------------
> >>>>>>> For information about J forums see
> >> http://www.jsoftware.com/forums.htm
> >>>>>> ---
> >>>>>> This email has been checked for viruses by AVG.
> >>>>>> https://www.avg.com
> >>>>>>
> >>>>>>
> ----------------------------------------------------------------------
> >>>>>> For information about J forums see
> >> http://www.jsoftware.com/forums.htm
> >>>>>
> ----------------------------------------------------------------------
> >>>>> For information about J forums see
> http://www.jsoftware.com/forums.htm
> >>>>>
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to