The matrix matrix multiplication seems like an ugly hack to me, I'm actually in
favor to keep using the old API until we can switch to 21.0.
I am more than willing to admit it's an ugly hack; however, I started an
email thread back in November during ApacheCon regarding testing Mahout
with 0.21 of Hadoop, and the general consensus was to avoid 0.21 until
the next version of Hadoop was released (seems even the Hadoop folks
don't care for 0.21 too much). I'm more than happy to pick up those
experiments and relay the results, if the sentiment has changed.
2 ) This implementation uses 3 M/R jobs where the original one has only 1. I
agree that the first 2 two jobs are very basic operations, but still for
performance's sake it's better to keep the amount of jobs low. I'm almost 100%
certain that this implementation will be slower than the original one ( though
I have no idea how much slower, would be interesting to know )
I completely agree, I just wasn't sure how to do the join operation (see
my emails with Sean Owen) in the absence of CompositeInputFormat. It
sounds from Sean, however, that this is still possible; I just need to
do my research.
3 ) Every row of the DRM now has an extra String variable to store and send.
Certainly when the matrix is very sparse this will result in a substantial
overhead.
No arguments here. From the same email thread with Sean, though, it
sounds as though VectorWritable might have what we need without having
to resort to what is effectively a Writable wrapper
(NamedVectorWritable), so I'll do that.
4 ) the MatrixMultiplicationReducer receives a NamedVectorWritable, but there's
no reason for this. It would be better to use a plain VectorWritable.
I noticed this while I was doing the patch, but since the input and
output of the Combiner have to be the same, I didn't see an alternative
(unless there's a way around this?).
If we insist in compliance with 20.2, it might be interesting to have a look at:
http://homepage.mac.com/j.norstad/matrix-multiply/index.html
This implementation avoids the use of compositeinputformat by checking the
current inputpath in the setup.
This is an awesome webpage! I'll read over this more carefully soon, but
what you mentioned was my original strategy: to check the input path
being read from within the Mappers/Reducers. Unfortunately, I couldn't
find a way to do this, as the only "Path" I could check was the
"currentWorkingDirectory", which just turned out to be MAHOUT_HOME (at
least on my dev machine). If there's a way of doing this, please do let
me know.
Some more general remarks: I think the matrix multiplication can be implemented
more efficiently. I've done a matrix multiplication of a sparse 500kx15k matrix
with around 35 million elements on a quite powerful cluster of 10 nodes, and
this took around 30 minutes. I have no idea of the performance of the
implementation described at
http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't really
compare. But Imho this can be improved ( though it's possible that the poor
performance was due to mistakes made by me )
I will definitely investigate these methods over the coming days, these
look fantastic.
Shannon