sure, I understand. Let me just quickly explain the matrix-vector behavior you've observed: I don't know your experimental setting, but for a 1kx1k matrix-vector the small input (8MB) likely fits into L3 cache. If you would increase the data sizes, let's say to 8GB (where you actually read from main memory in each run), you should see the difference.


On 12/1/2016 2:29 AM, wrote:
Hi Matthias,

thanks for the clarification as to why the current situation exists.

As I said: I didn't really run any serious benchmarks here. These are
simple comparisons that I ran to get a general feeling of where we are
with speed. The numbers for Breeze without native are definitely slower
than SystemML (about ~3-4x) but that is not surprising and also not what
I wanted to look at. Breeze is known to be slow ;)

The problem I wanted to address here was actually in the context of DL
and so dense/dense was my major concern. It's clear that the
benefit/penalty of native operations heavily depends on other factors.
Just out of curiosity I tried Matrix/Vector multiply and SystemML is
actually 2x faster than native BLAS. But then this is 1ms vs. 2ms which
might even be within a standard deviation (didn't compute that though).

But anyways - I didn't want to argue for a major change here, I was
interested in a more systematic analysis of where we are compared to
other (a) low-level linear algebra libraries (b) DL frameworks. To do
this would definitely require setting up a more "scientific" benchmark
suite than my little test here.


Am 01.12.2016 01:00 schrieb Matthias Boehm:
ok, then let's sort this out one by one

1) Benchmarks: There are a couple of things we should be aware of for
these native/java benchmarks. First, please specify k as the number of
logical cores on your machine and use a sufficiently large heap with
Xms=Xmx and Xmn=0.1*Xmx. Second, exclude the initial warmup runs for
JIT compilation or outliers where GC happened from these measurements.

2) Breeze Comparison: Please also get the breeze numbers without
native BLAS libraries as another baseline with comparable runtime

3) Bigger Picture: Just to clarify the overall question here - of
course native BLAS libraries are expected to be faster for squared (or
similar) dense matrix multiply, as current JDKs usually only compile
scalar but no packed SIMD instructions for these operations. How much
depends on the architecture. On older architectures with 128bit and
256bit vector units, it was not too problematic. But the trend
continues and hence it is worth thinking about it if nothing happens
on the JDK front. The reasons why we decided for platform independence
in the past were as follows:

(a) Squared dense matrix multiply is not a common operation (other
than in DL). Much more common are memory-bandwidth bound matrix-vector
multiplications and there it actually leads to a 3x slowdown copying
your data out to a native library.
(b) In end-to-end algorithms, especially on large-scale scenarios, we
often see other factors dominating performance.
(c) Keeping the build and deployment simple without the dependency to
native libraries was the logical conclusion given (a) and (b).
(d) There are also workarounds: A user can always (and we did this in
the past with certain LAPACK functions), define an external function
and call there whatever library she wants.


On 12/1/2016 12:27 AM, wrote:
This is the printout from 50 iterations with timings decommented:

MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 465.897145
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 389.913848
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 426.539142
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 391.878792
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 349.830464
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 284.751495
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 337.790165
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 363.655144
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 334.348717
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 745.822571
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 1257.83537
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 313.253455
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 268.226473
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 252.079117
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 254.162898
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 257.962804
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 279.462628
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.553724
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 269.316559
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 245.755306
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 266.528604
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.022494
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 269.964251
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 246.011221
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 309.174575
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 254.311429
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 262.97415
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 256.096419
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.975642
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 262.577342
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 287.840992
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.495411
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 253.541925
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.485217
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 266.114958
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 260.231448
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 260.012622
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 267.912608
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 264.265422
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 276.937746
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 261.649393
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 245.334056
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 258.506884
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 243.960491
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 251.801208
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 271.235477
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 275.290229
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 251.290325
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 265.851277
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.902494

Am 01.12.2016 00:08 schrieb Matthias Boehm:
Could you please make sure you're comparing the right thing. Even on
old sandy bridge CPUs our matrix mult for 1kx1k usually takes 40-50ms.
We also did the same experiments with larger matrices and SystemML was
about 2x faster compared to Breeze. Please decomment the timings in
LibMatrixMult.matrixMult and double check the timing as well as that
we're actually comparing dense matrix multiply.


On 11/30/2016 11:54 PM, wrote:
Hi all,

I have run a very quick comparison between SystemML's LibMatrixMult
Breeze matrix multiplication using native BLAS (OpenBLAS through
netlib-java). As per my very small comparison I get the result that
there is a performance difference for dense-dense Matrices of size
x 1000 (our default blocksize) with Breeze being about 5-6 times
here. The code I used can be found here:

Running this code with 50 iterations each gives me for example average
times of:
Breeze:         49.74 ms
SystemML:   363.44 ms

I don't want to say this is true for every operation, but those
let us form the hypothesis that native BLAS operations can lead to a
significant speedup for certain operations which is worth testing with
more advanced benchmarks.

Btw: I am definitely not saying we should use Breeze here. I am more
looking at native BLAS and LAPACK implementations in general (as
provided by OpenBLAS, MKL, etc.).

Let me know what you think!

Reply via email to