[
https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976204#action_12976204
]
Joris Geessels commented on MAHOUT-537:
---------------------------------------
The matrix matrix multiplication seems like an ugly hack to me, I'm actually in
favor to keep using the old API until we can switch to 21.0.
Some remarks:
1) I didn't test the code either, but couldn't spot any obvious errors. So it
seems to me that it should work.
2 ) This implementation uses 3 M/R jobs where the original one has only 1. I
agree that the first 2 two jobs are very basic operations, but still for
performance's sake it's better to keep the amount of jobs low. I'm almost 100%
certain that this implementation will be slower than the original one ( though
I have no idea how much slower, would be interesting to know )
3 ) Every row of the DRM now has an extra String variable to store and send.
Certainly when the matrix is very sparse this will result in a substantial
overhead.
4 ) the MatrixMultiplicationReducer receives a NamedVectorWritable, but there's
no reason for this. It would be better to use a plain VectorWritable.
If we insist in compliance with 20.2, it might be interesting to have a look at:
http://homepage.mac.com/j.norstad/matrix-multiply/index.html
This implementation avoids the use of compositeinputformat by checking the
current inputpath in the setup.
Some more general remarks: I think the matrix multiplication can be implemented
more efficiently. I've done a matrix multiplication of a sparse 500kx15k matrix
with around 35 million elements on a quite powerful cluster of 10 nodes, and
this took around 30 minutes. I have no idea of the performance of the
implementation described at
http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't really
compare. But Imho this can be improved ( though it's possible that the poor
performance was due to mistakes made by me )
> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
> Key: MAHOUT-537
> URL: https://issues.apache.org/jira/browse/MAHOUT-537
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 0.4
> Reporter: Shannon Quinn
> Assignee: Shannon Quinn
> Attachments: MAHOUT-537.patch, MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API,
> in particular eliminate dependence on the deprecated JobConf, using instead
> the separate Job and Configuration objects.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.