Actually both AVX and cache are important: cache ordering is necessary to read the input faster, then AVX instructions to process them. It keeps a single CPU pretty busy, so BLAS must be doing something special.

Henry Rich

On 4/21/2017 12:10 PM, bill lam wrote:
Improvement coming from avx is not that important, the major improvement in
inner product is the algorithm had been tuned to keep cache hot. Using sse2
can also achieve similar improvement factor when using the new codes.

blas is super optimized, it might have running in multiple threads, so that
on my  4 cores cpu, it runs 4x faster, still not sure from where the rest
5x improvement comes. offloading to gpu in i7?


On 21 Apr, 2017 11:37 pm, "Xiao-Yong Jin" <[email protected]> wrote:

Thanks.  It's interesting to see how far the avx has come along.
And the |: takes more than 20% of the time?  I guess this is something that
could be improved.

On Apr 21, 2017, at 7:42 AM, bill lam <[email protected]> wrote:

Opp, output from dgemm should be transposed to row major.

dgemm=: 'liblapack.so.3 dgemm_ > n *c *c *i *i *i *d *d *i *d *i *d *d
*i'&cd
mm=: 4 : 0
k=. ,{.$x
c=. (k,k)$1.5-1.5
dgemm (,'T');(,'T');k;k;k;(,2.5-1.5);x;k;y;k;(,1.5-1.5);c;k
|:c
)

'A B'=:0?@$~2,,~4096
echo timespacex'c1=: A+/ .*B'
echo timespacex'c2=: A mm B'
echo c1-:c2

NB. avx
   load'dgemm.ijs'
19.4683 2.68438e8
1.11488 5.36873e8
1

NB. j602
167.99789 2.684384e8
1.224063 5.369056e8
1

j806 version is already quite good.

Пт, 21 апр 2017, bill lam написал(а):
I tested with J calling lapack for matrix multiplication with the
following script,

NB. extern dgemm_(char * transa, char * transb, int * m, int * n, int *
k,
NB.               double * alpha, double * A, int * lda,
NB.               double * B, int * ldb, double * beta,
NB.               double * C, int * ldc);

dgemm=: 'liblapack.so.3 dgemm_ > n *c *c *i *i *i *d *d *i *d *i *d *d
*i'&cd
mm=: 4 : 0
k=. ,{.$x
c=. (k,k)$1.5-1.5
dgemm (,'T');(,'T');k;k;k;(,2.5-1.5);x;k;y;k;(,1.5-1.5);c;k
c
)

'A B'=:0?@$~2,,~4096
echo timespacex'A+/ .*B'
echo timespacex'A mm B'

result was,
19.3608 2.68437e8
0.886447 2.68442e8

Note it need to use an optimized version of blas, not the
reference blas.

Apparently the blas used in julia is sub-optimal.

Вт, 18 апр 2017, bill lam написал(а):
I think julia just calls blas.

Пн, 17 апр 2017, Xiao-Yong Jin написал(а):
On Apr 17, 2017, at 9:26 PM, Henry Rich <[email protected]> wrote:

If you have an implementation of +/ . * on double-precision floats
that's faster than J 8.06, I would be obliged if you'd send me a copy of
the source code.
I'm sure your code is much faster than naive c loops.  But some how
the matrix-matrix multiplication is much slower (10x) than that in julia
(tested with a 3-year old version).
% julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.1 (2014-02-11 06:30 UTC)
_/ |\__'_|_|_|\__'_|  |
|__/                   |  x86_64-linux-gnu

julia> A=rand(4096,4096); B=rand(4096,4096);

julia> @time A*B;
elapsed time: 2.260157127 seconds (149184640 bytes allocated)

julia>
% jconsole
   JVERSION
Engine: j806/j64avx/linux
Beta-3: commercial/2017-04-10T17:51:14
Library: 8.06.02
Platform: Linux 64
Installer: J806 install
InstallPath: /nfs2/xjin/pkgs/j64-806
Contact: www.jsoftware.com
   'A B'=:0?@$~2,,~4096
   timespacex'A+/ .*B'
23.8976 2.68437e8
   timespacex'A+/ .*B'

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
--
regards,
====================================================
GPG key 1024D/4434BAB3 2008-08-24
gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3
--
regards,
====================================================
GPG key 1024D/4434BAB3 2008-08-24
gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3
--
regards,
====================================================
GPG key 1024D/4434BAB3 2008-08-24
gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm


---
This email has been checked for viruses by AVG.
http://www.avg.com

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to