Hi all, I'm trying to implement a clustering algorithm on hadoop. Among other things, there're a lot of matrix multiplications. LINA (http://wiki.apache.org/lucene-hadoop/Lina) is probably going to be a perfect fit here, but I can't afford to wait. Btw, I can't find HADOOP-1655 any more, what's going on?
Using the ordinary matrix product (sum of row by column products gives one element from the resulting matrix), the easiest way to formulate this computation is to have one row and one column sent to a mapper and the output would be one element from the resulting matrix. Reducer can take this element and put it into the correct position in the output file. I need your advice on how to design input file(s) and how to make input splits then. I'd like to have matrices in separate files (they'll be used for more than one multiplication, and it's cleaner to have them separate). I guess then I'd have to use MultiFileSplit and MultiFileInputFormat somehow. Is it possible at all to send two records (one row and one column, or two rows if the other matrix is column-oriented ordered) from two input splits to a single mapper? Or should I look for an alternative way to multiply matrixes? -- regards, Milan