Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Shannon Quinn Sun, 02 Jan 2011 12:29:05 -0800

The matrix matrix multiplication seems like an ugly hack to me, I'm actually in 
favor to keep using the old API until we can switch to 21.0.

I am more than willing to admit it's an ugly hack; however, I started anemail thread back in November during ApacheCon regarding testing Mahoutwith 0.21 of Hadoop, and the general consensus was to avoid 0.21 untilthe next version of Hadoop was released (seems even the Hadoop folksdon't care for 0.21 too much). I'm more than happy to pick up thoseexperiments and relay the results, if the sentiment has changed.

2 ) This implementation uses 3 M/R jobs where the original one has only 1. I 
agree that the first 2 two jobs are very basic operations, but still for 
performance's sake it's better to keep the amount of jobs low.  I'm almost 100% 
certain that this implementation will be slower than the original one ( though 
I have no idea how much slower, would be interesting to know )

I completely agree, I just wasn't sure how to do the join operation (seemy emails with Sean Owen) in the absence of CompositeInputFormat. Itsounds from Sean, however, that this is still possible; I just need todo my research.

3 ) Every row of the DRM now has an extra String variable to store and send. 
Certainly when the matrix is very sparse this will result in a substantial 
overhead.

No arguments here. From the same email thread with Sean, though, itsounds as though VectorWritable might have what we need without havingto resort to what is effectively a Writable wrapper(NamedVectorWritable), so I'll do that.

4 ) the MatrixMultiplicationReducer receives a NamedVectorWritable, but there's 
no reason for this. It would be better to use a plain VectorWritable.

I noticed this while I was doing the patch, but since the input andoutput of the Combiner have to be the same, I didn't see an alternative(unless there's a way around this?).

If we insist in compliance with 20.2, it might be interesting to have a look at:
http://homepage.mac.com/j.norstad/matrix-multiply/index.html
This implementation avoids the use of compositeinputformat by checking the 
current inputpath  in the setup.

This is an awesome webpage! I'll read over this more carefully soon, butwhat you mentioned was my original strategy: to check the input pathbeing read from within the Mappers/Reducers. Unfortunately, I couldn'tfind a way to do this, as the only "Path" I could check was the"currentWorkingDirectory", which just turned out to be MAHOUT_HOME (atleast on my dev machine). If there's a way of doing this, please do letme know.

Some more general remarks: I think the matrix multiplication can be implemented 
more efficiently. I've done a matrix multiplication of a sparse 500kx15k matrix 
with around 35 million elements on a quite powerful cluster of 10 nodes, and 
this took around 30 minutes. I have no idea of the performance of the 
implementation described at 
http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't really 
compare. But Imho this can be improved ( though it's possible that the poor 
performance was due to mistakes made by me )

I will definitely investigate these methods over the coming days, theselook fantastic.


Shannon

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Reply via email to