Hi,
i detected the following behaviour, that seems a bit strange to me:
Be v=(v1, v2,...,vn) and w=(w1, w2, ...,wm) vectors, that are used to
compute the similarity between two items/users. If all vi, that overlap
with w (this means vi!=0 and wi!=0), are equal, and if all wj, that
overlap with v, are equal, no euclidean or pearson similarity can be
computed.
The attached test considers the following vectors: v=(0,2; 0,2; 0,4) and
w=(0,7; 0,7; 0). The overlapping vector components of v are all 0,2. The
overlapping components of w are all 0,7.
The problem is, that "double computeResult(int n, double sumXY, double
sumX2, double sumY2, double sumXYdiff2)" in the corresponding subclass
of AbstractSimilarity is called with parameters sumXY=sumX2=sumY2=0 and
therefore returns Double.NaN. This behaviour contradicts the behaviour
described in the book "Mahout in Action", p.49. The last complete
sentence here is: "Note that we were able compute some notion of
similarity for all pairs of users here, whereas the Pearson correlation
couldn't produce an answer for users 1 and 3." Because of the described
problem, the euclidean algorithm can't produce an answer either. This is
a special case of the described problem, where there is only one overlap.
Regards,
Mattias
--
--------------------------------
Mattias Hilliges
Softwareentwicklung
Forschung und Entwicklung
neofonie
Technologieentwicklung und
Informationsmanagement GmbH
Robert-Koch-Platz 4
10115 Berlin
fon: +49.30 24627 100
fax: +49.30 24627 120
mattias.hilli...@neofonie.de
http://www.neofonie.de
Handelsregister
Berlin-Charlottenburg: HRB 67460
Geschaeftsfuehrung
Helmut Hoffer von Ankershoffen
(Sprecher der Geschaeftsfuehrung)
Nurhan Yildirim
--------------------------------
1,1,0.2
1,2,0.7
2,1,0.2
2,2,0.7
3,1,0.4
/*
* (c) neofonie Technologieentwicklung und Informationsmanagement GmbH
*
* This computer program is the sole property of neofonie GmbH
* (http://www.neofonie.de) and is protected under the German Copyright Act
* (paragraph 69a UrhG). All rights are reserved. Making copies,
* duplicating, modifying, using or distributing this computer program
* in any form, without prior written consent of neofonie, is
* prohibited. Violation of copyright is punishable under the
* German Copyright Act (paragraph 106 UrhG). Removing this copyright
* statement is also a violation.
*/
package de.neofonie.recommendation.system.connectors.businesslogic;
import static org.junit.Assert.assertEquals;
import java.io.File;
import java.util.List;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.EuclideanDistanceSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.similarity.ItemSimilarity;
import org.junit.Test;
/**
* @author hilli...@neofonie.de
*/
public class TestAbstractSimilarity {
/**
* This test documents a problem with mahout: Be v=(v1, v2,...,vn) and w=(w1, w2, ...,wm) vectors, that
* are used to compute the similarity between two items/users. If all vi, that overlap with w (this means
* vi!=0 and wi!=0), are equal, and if all wj, that overlap with v, are equal, no similarity can be
* computed.<br>
* In the following test, the following vectors are considered: v=(0,2; 0,2; 0,4) and w=(0,7; 0,7; 0).
* The overlapping vector components of v are all 0,2. The overlapping components of w are all 0,7.
*/
@Test
public void testComponentsEqual() throws Exception {
DataModel model = new FileDataModel(new File("src/test/resources/abstractSimilarity.csv"));
ItemSimilarity similarity = new EuclideanDistanceSimilarity(model);
GenericItemBasedRecommender recommender = new GenericItemBasedRecommender(model, similarity);
List<RecommendedItem> recommendations = recommender.mostSimilarItems(1, 1);
assertEquals(1, recommendations.size());
RecommendedItem firstRecommendation = recommendations.get(0);
assertEquals(2l, firstRecommendation.getItemID());
}
}