Hi Ted, Thanks for the piece of advice. I have put a first implementation of the Mahalanobis distance in http://www.nicolas-maillot.net/MahoutContribs/ Those files should be put in three different places to be tested:
-in core/src/test/java/org/apache/mahout/common/distance : http://www.nicolas-maillot.net/MahoutContribs/TestMahalanobisDistanceMeasure.java - in core/src/main/java/org/apache/mahout/common/distance : http://www.nicolas-maillot.net/MahoutContribs/MahalanobisDistanceMeasure.java -in math/src/main/java/org/apache/mahout/math : http://www.nicolas-maillot.net/MahoutContribs/SingularValueDecomposition.java http://www.nicolas-maillot.net/MahoutContribs/Algebra.java All compiles and the unit tests are locally fine with the latest version of Mahout. I only added new files, so it limit risks. As you can see, I have migrated SingularValueDecomposition (useful to tackle the inverse covariance problem) to use the Matrix Class. I just Started with Algebra too (just what I needed for the Mahalanobis stuff), I will continue asap. One question, is putting those new implementations in src/main/java/org/apache/mahout/math a good way to go ? Migrating the stuff in new files is easier, once we are done, we will be able to remove the old implementation. We just need to find the best place for that. Any comment or feedback is welcome. I will then submit an official path with a few more comments. Many thanks, Cheers, Nicolas On Sat, Jul 17, 2010 at 11:40 PM, Ted Dunning <[email protected]> wrote: > On Sat, Jul 17, 2010 at 2:07 PM, Nicolas Maillot <[email protected]> wrote: > >> Hi all, >> >> I would like to contribute to Mahout by starting with a simple task. >> >> I had the idea of Implementing the Mahalanobis distance : >> http://en.wikipedia.org/wiki/Mahalanobis_distance >> This should be quite easy and could be a useful feature. >> >> After looking at the Mahout code for a couple of hours, I have three >> questions: >> >> 1) Generally speaking, what is the level of reliability of algorithms >> implemented in matrix.linalg ? >> > > They should be reasonably good, but they mostly lack unit tests. Further, > they have generally not been converted to use Vector and Matrix which is > something that is probably necessary over time. I have done a little bit in > this line, but we need quite a bit more. > > As you find bits that you want to use, you should convert them over. THis > will be a bit like pulling a thread on a sweater in that it will cause lots > of additional bits to need conversion. I will help with this. As part of > the conversion, we should add unit tests as well. > > >> 2) As inverting a covariance matrix is required, is using the method >> inverse(DoubleMatrix2D A) from class Algebra a good way to achieve this ? >> > > Generally it is considered very bad practice to invert a matrix. Instead, > you should use some sort of easily solved matrix decomposition. Commonly, > QR decomposition is used, but SVD might be useful in certain situations. > > As I remember Mahalonobis, it uses moments to estimate principal components > in the form of the inverse covariance matrix. Mahalonobis distances are > then computed in this reference frame. This should be amenable to QR > decomposition. > > >> 3) Does is look reasonable if the computation of the distance is based on >> dense matrices ( inside the method distance(Vector v1,Vector v2) method >> part >> of the future MahalanobisDistanceMeasure class) ? >> > > For use in recommendations, it should be assumed that the inputs are sparse, > but internal products might well be dense if they are fixed in size and > relatively small. The computation of distance should accept sparse vectors, > but projecting them down to small dense vectors to do the distance > computation is just fine. > > >> 4) What is the best way to create a DoubleMatrix1D from a Vector ? >> > > Don't. > > Just convert the code using the DoubleMatrix1D as mentioned above. >
