Matthias Boehm created SYSTEMML-1004:
----------------------------------------

             Summary: New spark tsmm2 matrix multiplication operator
                 Key: SYSTEMML-1004
                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1004
             Project: SystemML
          Issue Type: Task
            Reporter: Matthias Boehm


The performance experiments for our 0.11 release, revealed performance issues 
for LinregDS and PCA (specifically for {{t(X)%*%X}}) whenever the number of 
columns is larger than the blocksize. For example, the following scenario shows 
LinregDS results for an input size of 10M x 1K with blocksize of 1K. For 
scenarios with icp>0, we append a column of ones which exceeds the blocksize 
and hence we compile a {{cpmm}} instead of {{tsmm}} instruction.

{code}
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 122
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 350
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 297
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 81
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 279
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 360
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 286
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 299
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 82
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 292
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 292
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 82
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 290
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 301
{code}

We should introduce a new {{tsmm2}} operation for the scenario where the excess 
columns fit into the broadcast memory budget, which would allow us to compute 
this expression without shuffling t(X) and X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to