simplify or alternative Similarity
arithmetic(AbstractDistributedVectorSimilarity) for boolean data
----------------------------------------------------------------------------------------------------
Key: MAHOUT-466
URL: https://issues.apache.org/jira/browse/MAHOUT-466
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.4
Reporter: Hui Wen Han
Fix For: 0.4
For boolean data ,the prefValue is always 1.0f, We need simplify Similarity
arithmetic
for example:
1) DistributedEuclideanDistanceVectorSimilarity
package org.apache.mahout.math.hadoop.similarity.vector;
import org.apache.mahout.math.hadoop.similarity.Cooccurrence;
/**
* distributed implementation of euclidean distance as vector similarity
measure
*/
public class DistributedEuclideanDistanceVectorSimilarity extends
AbstractDistributedVectorSimilarity {
@Override
protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence>
cooccurrences, double weightOfVectorA,
double weightOfVectorB, int numberOfColumns) {
double n = 0.0;
double sumXYdiff2 = 0.0;
for (Cooccurrence cooccurrence : cooccurrences) { double diff =
cooccurrence.getValueA() - cooccurrence.getValueB(); sumXYdiff2 += diff * diff;
n++; }
return n / (1.0 + Math.sqrt(sumXYdiff2));
}
}
this one is always return n (=cooccurrence.size())
2) DistributedUncenteredCosineVectorSimilarity
/**
* distributed implementation of cosine similarity that does not center its
data
*/
public class DistributedUncenteredCosineVectorSimilarity extends
AbstractDistributedVectorSimilarity {
@Override
protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence>
cooccurrences, double weightOfVectorA,
double weightOfVectorB, int numberOfColumns) {
int n = 0;
double sumXY = 0.0;
double sumX2 = 0.0;
double sumY2 = 0.0;
for (Cooccurrence cooccurrence : cooccurrences) { double x =
cooccurrence.getValueA(); double y = cooccurrence.getValueB(); sumXY += x * y;
sumX2 += x * x; sumY2 += y * y; n++; }
if (n == 0) { return Double.NaN; }
double denominator = Math.sqrt(sumX2) * Math.sqrt(sumY2);
if (denominator == 0.0) { // One or both vectors has -all- the same values; //
can't really say much similarity under this measure return Double.NaN; }
return sumXY / denominator;
}
}
this one will always return 1.0
3) DistributedUncenteredZeroAssumingCosineVectorSimilarity
If n users like ItemA, m users like ItemB,p users like both ItemA and ItemB,
DistributedUncenteredZeroAssumingCosineVectorSimilarity return p/(m*n).
it also can use for Boolean data, but we can provide a simple one , return
(p*p)/(m*n),no so much computing.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.