I am not sure if this matters in this context, but using this formula will lose precision for very near points. That can affect ordering in the limit.
By lose precision, I mean it can degrade to 7-8 sig figs instead of 16 or so. I doubt this matters, but I wouldn't know if it does. ---------- Forwarded message ---------- From: <[email protected]> Date: Fri, Oct 26, 2012 at 11:49 AM Subject: svn commit: r1402553 - /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/measures/EuclideanDistanceSimilarity.java To: [email protected] Author: srowen Date: Fri Oct 26 15:49:47 2012 New Revision: 1402553 URL: http://svn.apache.org/viewvc?rev=1402553&view=rev Log: Fix possible NaN issue in Euclidean distance, per http://stackoverflow.com/questions/13089214/nan-distances-in-mahout-euclidean-implementation Modified: mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/measures/EuclideanDistanceSimilarity.java Modified: mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/measures/EuclideanDistanceSimilarity.java URL: http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/measures/EuclideanDistanceSimilarity.java?rev=1402553&r1=1402552&r2=1402553&view=diff ============================================================================== --- mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/measures/EuclideanDistanceSimilarity.java (original) +++ mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/measures/EuclideanDistanceSimilarity.java Fri Oct 26 15:49:47 2012 @@ -46,7 +46,9 @@ public class EuclideanDistanceSimilarity @Override public double similarity(double dots, double normA, double normB, int numberOfColumns) { - double euclideanDistance = Math.sqrt(normA - 2 * dots + normB); + // Arg can't be negative in theory, but can in practice due to rounding, so cap it. + // Also note that normA / normB are actually the squares of the norms. + double euclideanDistance = Math.sqrt(Math.max(0.0, normA - 2 * dots + normB)); return 1.0 / (1.0 + euclideanDistance); }
