On Sat, Jul 17, 2010 at 2:07 PM, Nicolas Maillot <[email protected]> wrote:
> Hi all, > > I would like to contribute to Mahout by starting with a simple task. > > I had the idea of Implementing the Mahalanobis distance : > http://en.wikipedia.org/wiki/Mahalanobis_distance > This should be quite easy and could be a useful feature. > > After looking at the Mahout code for a couple of hours, I have three > questions: > > 1) Generally speaking, what is the level of reliability of algorithms > implemented in matrix.linalg ? > They should be reasonably good, but they mostly lack unit tests. Further, they have generally not been converted to use Vector and Matrix which is something that is probably necessary over time. I have done a little bit in this line, but we need quite a bit more. As you find bits that you want to use, you should convert them over. THis will be a bit like pulling a thread on a sweater in that it will cause lots of additional bits to need conversion. I will help with this. As part of the conversion, we should add unit tests as well. > 2) As inverting a covariance matrix is required, is using the method > inverse(DoubleMatrix2D A) from class Algebra a good way to achieve this ? > Generally it is considered very bad practice to invert a matrix. Instead, you should use some sort of easily solved matrix decomposition. Commonly, QR decomposition is used, but SVD might be useful in certain situations. As I remember Mahalonobis, it uses moments to estimate principal components in the form of the inverse covariance matrix. Mahalonobis distances are then computed in this reference frame. This should be amenable to QR decomposition. > 3) Does is look reasonable if the computation of the distance is based on > dense matrices ( inside the method distance(Vector v1,Vector v2) method > part > of the future MahalanobisDistanceMeasure class) ? > For use in recommendations, it should be assumed that the inputs are sparse, but internal products might well be dense if they are fixed in size and relatively small. The computation of distance should accept sparse vectors, but projecting them down to small dense vectors to do the distance computation is just fine. > 4) What is the best way to create a DoubleMatrix1D from a Vector ? > Don't. Just convert the code using the DoubleMatrix1D as mentioned above.
