The matrix matrix multiplication seems like an ugly hack to me, I'm actually in 
favor to keep using the old API until we can switch to 21.0.
I am more than willing to admit it's an ugly hack; however, I started an email thread back in November during ApacheCon regarding testing Mahout with 0.21 of Hadoop, and the general consensus was to avoid 0.21 until the next version of Hadoop was released (seems even the Hadoop folks don't care for 0.21 too much). I'm more than happy to pick up those experiments and relay the results, if the sentiment has changed.
2 ) This implementation uses 3 M/R jobs where the original one has only 1. I 
agree that the first 2 two jobs are very basic operations, but still for 
performance's sake it's better to keep the amount of jobs low.  I'm almost 100% 
certain that this implementation will be slower than the original one ( though 
I have no idea how much slower, would be interesting to know )
I completely agree, I just wasn't sure how to do the join operation (see my emails with Sean Owen) in the absence of CompositeInputFormat. It sounds from Sean, however, that this is still possible; I just need to do my research.
3 ) Every row of the DRM now has an extra String variable to store and send. 
Certainly when the matrix is very sparse this will result in a substantial 
overhead.
No arguments here. From the same email thread with Sean, though, it sounds as though VectorWritable might have what we need without having to resort to what is effectively a Writable wrapper (NamedVectorWritable), so I'll do that.
4 ) the MatrixMultiplicationReducer receives a NamedVectorWritable, but there's 
no reason for this. It would be better to use a plain VectorWritable.
I noticed this while I was doing the patch, but since the input and output of the Combiner have to be the same, I didn't see an alternative (unless there's a way around this?).
If we insist in compliance with 20.2, it might be interesting to have a look at:
http://homepage.mac.com/j.norstad/matrix-multiply/index.html
This implementation avoids the use of compositeinputformat by checking the 
current inputpath  in the setup.

This is an awesome webpage! I'll read over this more carefully soon, but what you mentioned was my original strategy: to check the input path being read from within the Mappers/Reducers. Unfortunately, I couldn't find a way to do this, as the only "Path" I could check was the "currentWorkingDirectory", which just turned out to be MAHOUT_HOME (at least on my dev machine). If there's a way of doing this, please do let me know.
Some more general remarks: I think the matrix multiplication can be implemented 
more efficiently. I've done a matrix multiplication of a sparse 500kx15k matrix 
with around 35 million elements on a quite powerful cluster of 10 nodes, and 
this took around 30 minutes. I have no idea of the performance of the 
implementation described at 
http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't really 
compare. But Imho this can be improved ( though it's possible that the poor 
performance was due to mistakes made by me )
I will definitely investigate these methods over the coming days, these look fantastic.

Shannon

Reply via email to