[ 
https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976204#action_12976204
 ] 

Joris Geessels commented on MAHOUT-537:
---------------------------------------

The matrix matrix multiplication seems like an ugly hack to me, I'm actually in 
favor to keep using the old API until we can switch to 21.0. 
Some remarks: 
1) I didn't test the code either, but couldn't spot any obvious errors. So it 
seems to me that it should work.
2 ) This implementation uses 3 M/R jobs where the original one has only 1. I 
agree that the first 2 two jobs are very basic operations, but still for 
performance's sake it's better to keep the amount of jobs low.  I'm almost 100% 
certain that this implementation will be slower than the original one ( though 
I have no idea how much slower, would be interesting to know ) 
3 ) Every row of the DRM now has an extra String variable to store and send. 
Certainly when the matrix is very sparse this will result in a substantial 
overhead. 
4 ) the MatrixMultiplicationReducer receives a NamedVectorWritable, but there's 
no reason for this. It would be better to use a plain VectorWritable.

If we insist in compliance with 20.2, it might be interesting to have a look at:
http://homepage.mac.com/j.norstad/matrix-multiply/index.html 
This implementation avoids the use of compositeinputformat by checking the 
current inputpath  in the setup. 

Some more general remarks: I think the matrix multiplication can be implemented 
more efficiently. I've done a matrix multiplication of a sparse 500kx15k matrix 
with around 35 million elements on a quite powerful cluster of 10 nodes, and 
this took around 30 minutes. I have no idea of the performance of the 
implementation described at 
http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't really 
compare. But Imho this can be improved ( though it's possible that the poor 
performance was due to mistakes made by me )

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, 
> in particular eliminate dependence on the deprecated JobConf, using instead 
> the separate Job and Configuration objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to