[jira] Created: (MAHOUT-466) simplify or alternative Similarity arithmetic(AbstractDistributedVectorSimilarity) for boolean data

Hui Wen Han (JIRA) Thu, 12 Aug 2010 05:42:54 -0700

simplify or alternative  Similarity 
arithmetic(AbstractDistributedVectorSimilarity) for boolean data
----------------------------------------------------------------------------------------------------


                 Key: MAHOUT-466
                 URL: https://issues.apache.org/jira/browse/MAHOUT-466
             Project: Mahout
          Issue Type: Improvement
          Components: Collaborative Filtering
    Affects Versions: 0.4
            Reporter: Hui Wen Han
             Fix For: 0.4


For boolean data ,the prefValue  is  always 1.0f, We need simplify Similarity 
arithmetic

for example:
1) DistributedEuclideanDistanceVectorSimilarity 

package org.apache.mahout.math.hadoop.similarity.vector;

import org.apache.mahout.math.hadoop.similarity.Cooccurrence;

/**

    * distributed implementation of euclidean distance as vector similarity 
measure
      */
      public class DistributedEuclideanDistanceVectorSimilarity extends 
AbstractDistributedVectorSimilarity {

@Override
protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> 
cooccurrences, double weightOfVectorA,
double weightOfVectorB, int numberOfColumns) {

double n = 0.0;
double sumXYdiff2 = 0.0;

for (Cooccurrence cooccurrence : cooccurrences) { double diff = 
cooccurrence.getValueA() - cooccurrence.getValueB(); sumXYdiff2 += diff * diff; 
n++; }

return n / (1.0 + Math.sqrt(sumXYdiff2));
}

}

this one is always return n (=cooccurrence.size())
2) DistributedUncenteredCosineVectorSimilarity 
/**

    * distributed implementation of cosine similarity that does not center its 
data
      */
      public class DistributedUncenteredCosineVectorSimilarity extends 
AbstractDistributedVectorSimilarity {

@Override
protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> 
cooccurrences, double weightOfVectorA,
double weightOfVectorB, int numberOfColumns) {

int n = 0;
double sumXY = 0.0;
double sumX2 = 0.0;
double sumY2 = 0.0;

for (Cooccurrence cooccurrence : cooccurrences) { double x = 
cooccurrence.getValueA(); double y = cooccurrence.getValueB(); sumXY += x * y; 
sumX2 += x * x; sumY2 += y * y; n++; }

if (n == 0) { return Double.NaN; }
double denominator = Math.sqrt(sumX2) * Math.sqrt(sumY2);
if (denominator == 0.0) { // One or both vectors has -all- the same values; // 
can't really say much similarity under this measure return Double.NaN; }
return sumXY / denominator;
}

}

this one will always return 1.0
3) DistributedUncenteredZeroAssumingCosineVectorSimilarity 
If n users like ItemA, m users like ItemB,p users like both ItemA and ItemB,

DistributedUncenteredZeroAssumingCosineVectorSimilarity return p/(m*n).

it also can use for Boolean data, but we can provide a simple one , return 
(p*p)/(m*n),no so much computing.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (MAHOUT-466) simplify or alternative Similarity arithmetic(AbstractDistributedVectorSimilarity) for boolean data

Reply via email to