> Agreed, can you put your changes up as a patch on MAHOUT-121? That way we > can do file diffs, etc.
I'm willing to do so. But, then again, I didn't know for sure if what I changed was correct or if I just didn't interpret the distance calculation method. After reading what Shashinkant suggested, I'll try to make a patch. > --input ../nfantone/user.data --clusters ../nfantone/output/clusters --k 10 > --output ../content/nfantone/output/ --convergence 0.01 --overwrite I see you choose k=10, as oppose to 200. That could explain our differences. > We took the idea of optimized distance calculation from LingPipe. I > suggest you to read this as I won't be able to communicate the idea as > crisply. After reading this post, you will be able to relate the > code. > > http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/ I'll read this. I have a feeling I misunderstood the distance process. I'll comment soon. > Yes, some of the method names are misnomers (or downright misleading) > as there is no good term to describe the intermediate values. Surely we can think of another name for that particular method. A name may be misleading, but it shouldn't be something that refers to some OTHER thing it doesn't do. >> SquaredEuclideanDistanceMeasure.java, currently, does the following: >> >> if (centroid.size() != v.size()) { >> throw new CardinalityException(); >> } >> >> double result = centroidLengthSquare; >> result += v.getDistanceSquared(centroid); >> return centroidLengthSquare + v.getDistanceSquared(centroid); > > Here, we don't call v.getDistanceSquared(centroid) again (which is > redundant). It simply returns result calculated in the prvious step. You are completely right. My mistake. While refactoring, it seems I modified those lines and then left them unchanged when writing my post. That redundant call is not in the original code. However, I fail to see the need in declaring a double, += something to it, etc... this should simply be: if (centroid.size() != v.size()) { throw new CardinalityException(); } return centroidLengthSquare + v.getDistanceSquared(centroid); > With the optimization of LingPipe, you will incur the calculations > equal to the non-zero features in only one vector. You don't need to > iterate on cetroid vector for every distance calculation. > > Let me know if I am off the mark here. You aren't. Again, I think I got the algebra wrong. I'll let you know.
