`sure, I understand. Let me just quickly explain the matrix-vector`

`behavior you've observed: I don't know your experimental setting, but`

`for a 1kx1k matrix-vector the small input (8MB) likely fits into L3`

`cache. If you would increase the data sizes, let's say to 8GB (where you`

`actually read from main memory in each run), you should see the difference.`

Regards, Matthias

## Advertising

On 12/1/2016 2:29 AM, fschue...@posteo.de wrote:

Hi Matthias, thanks for the clarification as to why the current situation exists. As I said: I didn't really run any serious benchmarks here. These are simple comparisons that I ran to get a general feeling of where we are with speed. The numbers for Breeze without native are definitely slower than SystemML (about ~3-4x) but that is not surprising and also not what I wanted to look at. Breeze is known to be slow ;) The problem I wanted to address here was actually in the context of DL and so dense/dense was my major concern. It's clear that the benefit/penalty of native operations heavily depends on other factors. Just out of curiosity I tried Matrix/Vector multiply and SystemML is actually 2x faster than native BLAS. But then this is 1ms vs. 2ms which might even be within a standard deviation (didn't compute that though). But anyways - I didn't want to argue for a major change here, I was interested in a more systematic analysis of where we are compared to other (a) low-level linear algebra libraries (b) DL frameworks. To do this would definitely require setting up a more "scientific" benchmark suite than my little test here. Felix Am 01.12.2016 01:00 schrieb Matthias Boehm:ok, then let's sort this out one by one 1) Benchmarks: There are a couple of things we should be aware of for these native/java benchmarks. First, please specify k as the number of logical cores on your machine and use a sufficiently large heap with Xms=Xmx and Xmn=0.1*Xmx. Second, exclude the initial warmup runs for JIT compilation or outliers where GC happened from these measurements. 2) Breeze Comparison: Please also get the breeze numbers without native BLAS libraries as another baseline with comparable runtime platform. 3) Bigger Picture: Just to clarify the overall question here - of course native BLAS libraries are expected to be faster for squared (or similar) dense matrix multiply, as current JDKs usually only compile scalar but no packed SIMD instructions for these operations. How much depends on the architecture. On older architectures with 128bit and 256bit vector units, it was not too problematic. But the trend continues and hence it is worth thinking about it if nothing happens on the JDK front. The reasons why we decided for platform independence in the past were as follows: (a) Squared dense matrix multiply is not a common operation (other than in DL). Much more common are memory-bandwidth bound matrix-vector multiplications and there it actually leads to a 3x slowdown copying your data out to a native library. (b) In end-to-end algorithms, especially on large-scale scenarios, we often see other factors dominating performance. (c) Keeping the build and deployment simple without the dependency to native libraries was the logical conclusion given (a) and (b). (d) There are also workarounds: A user can always (and we did this in the past with certain LAPACK functions), define an external function and call there whatever library she wants. Regards, Matthias On 12/1/2016 12:27 AM, fschue...@posteo.de wrote:This is the printout from 50 iterations with timings decommented: MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 465.897145 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 389.913848 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 426.539142 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 391.878792 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 349.830464 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 284.751495 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 337.790165 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 363.655144 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 334.348717 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 745.822571 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 1257.83537 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 313.253455 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 268.226473 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 252.079117 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 254.162898 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 257.962804 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 279.462628 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.553724 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 269.316559 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 245.755306 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 266.528604 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.022494 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 269.964251 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 246.011221 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 309.174575 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 254.311429 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 262.97415 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 256.096419 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.975642 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 262.577342 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 287.840992 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.495411 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 253.541925 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.485217 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 266.114958 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 260.231448 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 260.012622 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 267.912608 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 264.265422 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 276.937746 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 261.649393 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 245.334056 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 258.506884 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 243.960491 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 251.801208 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 271.235477 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 275.290229 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 251.290325 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 265.851277 MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.902494 Am 01.12.2016 00:08 schrieb Matthias Boehm:Could you please make sure you're comparing the right thing. Even on old sandy bridge CPUs our matrix mult for 1kx1k usually takes 40-50ms. We also did the same experiments with larger matrices and SystemML was about 2x faster compared to Breeze. Please decomment the timings in LibMatrixMult.matrixMult and double check the timing as well as that we're actually comparing dense matrix multiply. Regards, Matthias On 11/30/2016 11:54 PM, fschue...@posteo.de wrote:Hi all, I have run a very quick comparison between SystemML's LibMatrixMult and Breeze matrix multiplication using native BLAS (OpenBLAS through netlib-java). As per my very small comparison I get the result that there is a performance difference for dense-dense Matrices of size 1000 x 1000 (our default blocksize) with Breeze being about 5-6 times faster here. The code I used can be found here: https://github.com/fschueler/incubator-systemml/blob/model_types/src/test/scala/org/apache/sysml/api/linalg/layout/local/SystemMLLocalBackendTest.scala Running this code with 50 iterations each gives me for example average times of: Breeze: 49.74 ms SystemML: 363.44 ms I don't want to say this is true for every operation, but those results let us form the hypothesis that native BLAS operations can lead to a significant speedup for certain operations which is worth testing with more advanced benchmarks. Btw: I am definitely not saying we should use Breeze here. I am more looking at native BLAS and LAPACK implementations in general (as provided by OpenBLAS, MKL, etc.). Let me know what you think! Felix