[SYSTEMML-1792] Performance sparse-dense block matrix multiply Our sparse-dense matrix multiply was already cache-conscious but used very small block static block sizes, which were optimized for moderate sparsity. However, for cases with very sparse matrices (and skinny right hand size matrices), the small block sizes added substantial overhead of more than an order of magnitude. This patch makes these block sizes adaptive, consistent with our cache-conscious implementations of sparsity exploiting matrix multiply operators such as wsloss.
On a scenario with 1K x 2M mini batches of average sparsity 0.0003 and a dense right hand side of 2M x 100, the runtime improved from ~300ms to <2ms, without hurting the case of moderate sparsity. Project: http://git-wip-us.apache.org/repos/asf/systemml/repo Commit: http://git-wip-us.apache.org/repos/asf/systemml/commit/cca4f942 Tree: http://git-wip-us.apache.org/repos/asf/systemml/tree/cca4f942 Diff: http://git-wip-us.apache.org/repos/asf/systemml/diff/cca4f942 Branch: refs/heads/master Commit: cca4f942cd8d4a2ac0b6e5994ebac4efdcdc06c3 Parents: 4a24b9a Author: Matthias Boehm <[email protected]> Authored: Thu Jul 20 01:21:19 2017 -0700 Committer: Matthias Boehm <[email protected]> Committed: Thu Jul 20 01:24:27 2017 -0700 ---------------------------------------------------------------------- .../org/apache/sysml/runtime/matrix/data/LibMatrixMult.java | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/systemml/blob/cca4f942/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixMult.java ---------------------------------------------------------------------- diff --git a/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixMult.java b/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixMult.java index 30e7d3d..c2af52d 100644 --- a/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixMult.java +++ b/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixMult.java @@ -1292,8 +1292,11 @@ public class LibMatrixMult { //blocksizes to fit blocks of B (dense) and several rows of A/C in common L2 cache size, //while blocking A/C for L1/L2 yet allowing long scans (2 pages) in the inner loop over j - final int blocksizeI = 32; - final int blocksizeK = 24; + //in case of almost ultra-sparse matrices, we cannot ensure the blocking for the rhs and + //output - however, in this case it's unlikely that we consume every cache line in the rhs + + final int blocksizeI = (int) (8L*m*cd/m1.nonZeros); + final int blocksizeK = (int) (8L*m*cd/m1.nonZeros); final int blocksizeJ = 1024; //temporary array of current sparse positions @@ -1314,7 +1317,7 @@ public class LibMatrixMult int[] aix = a.indexes(i); double[] avals = a.values(i); - int k = curk[i-bi] + apos; + int k = curk[i-bi] + apos; //rest not aligned to blocks of 4 rows int bn = alen%4; for( ; k<apos+bn && aix[k]<bkmin; k++ )
