[3/3] systemml git commit: [SYSTEMML-1792] Performance sparse-dense block matrix multiply

mboehm7 Thu, 20 Jul 2017 01:26:03 -0700

[SYSTEMML-1792] Performance sparse-dense block matrix multiply

Our sparse-dense matrix multiply was already cache-conscious but used
very small block static block sizes, which were optimized for moderate
sparsity. However, for cases with very sparse matrices (and skinny right
hand size matrices), the small block sizes added substantial overhead of
more than an order of magnitude. This patch makes these block sizes
adaptive, consistent with our cache-conscious implementations of
sparsity exploiting matrix multiply operators such as wsloss.


On a scenario with 1K x 2M mini batches of average sparsity 0.0003 and a
dense right hand side of 2M x 100, the runtime improved from ~300ms to
<2ms, without hurting the case of moderate sparsity.


Project: http://git-wip-us.apache.org/repos/asf/systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/systemml/commit/cca4f942
Tree: http://git-wip-us.apache.org/repos/asf/systemml/tree/cca4f942
Diff: http://git-wip-us.apache.org/repos/asf/systemml/diff/cca4f942

Branch: refs/heads/master
Commit: cca4f942cd8d4a2ac0b6e5994ebac4efdcdc06c3
Parents: 4a24b9a
Author: Matthias Boehm <[email protected]>
Authored: Thu Jul 20 01:21:19 2017 -0700
Committer: Matthias Boehm <[email protected]>
Committed: Thu Jul 20 01:24:27 2017 -0700

----------------------------------------------------------------------
 .../org/apache/sysml/runtime/matrix/data/LibMatrixMult.java | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/systemml/blob/cca4f942/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixMult.java
----------------------------------------------------------------------
diff --git 
a/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixMult.java 
b/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixMult.java
index 30e7d3d..c2af52d 100644
--- a/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixMult.java
+++ b/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixMult.java
@@ -1292,8 +1292,11 @@ public class LibMatrixMult
                        {                                                       
                                //blocksizes to fit blocks of B (dense) and 
several rows of A/C in common L2 cache size, 
                                //while blocking A/C for L1/L2 yet allowing 
long scans (2 pages) in the inner loop over j
-                               final int blocksizeI = 32;
-                               final int blocksizeK = 24; 
+                               //in case of almost ultra-sparse matrices, we 
cannot ensure the blocking for the rhs and
+                               //output - however, in this case it's unlikely 
that we consume every cache line in the rhs
+                               
+                               final int blocksizeI = (int) 
(8L*m*cd/m1.nonZeros);
+                               final int blocksizeK = (int) 
(8L*m*cd/m1.nonZeros);
                                final int blocksizeJ = 1024; 
                                
                                //temporary array of current sparse positions
@@ -1314,7 +1317,7 @@ public class LibMatrixMult
                                                                        int[] 
aix = a.indexes(i);
                                                                        
double[] avals = a.values(i);                                   
                                                                        
-                                                                       int k = 
curk[i-bi] + apos;                                                              
        
+                                                                       int k = 
curk[i-bi] + apos;                      
                                                                //rest not 
aligned to blocks of 4 rows
                                                                        int bn 
= alen%4;
                                                                        for( ; 
k<apos+bn && aix[k]<bkmin; k++ )

[3/3] systemml git commit: [SYSTEMML-1792] Performance sparse-dense block matrix multiply

Reply via email to