[
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584549#comment-14584549
]
lariven edited comment on MAHOUT-1739 at 6/13/15 11:46 AM:
-----------------------------------------------------------
the unit test in the project is at hand to use.
mvn test
-Dtest=org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJobTest
how to reproduce the bug:
step 1, at line 210 add two records to test data:
writeLines(inputFile,
"1,1,1",
"1,4,1",//added
"2,4,1",//added
"1,3,1",
"2,2,1",
"2,3,1",
"3,1,1",
"3,2,1",
"4,1,1",
"4,2,1",
"4,3,1",
"5,2,1",
"6,1,1",
"6,2,1");
step 2, set 231 line maxSimilaritiesPerItem from 1 to 2:
231 TanimotoCoefficat cientSimilarity.class.getName(),
"--maxSimilaritiesPerItem", "2" });
we expect output:
1 2 0.5
1 3 0.4
2 1 0.5
2 3 0.3333333333333333
3 1 0.4
3 4 0.6666666666666666
4 1 0.2
4 3 0.6666666666666666
but output:
1 2 0.5
1 3 0.4
1 4 0.2
2 3 0.3333333333333333
3 4 0.6666666666666666
why:
the weird switch of itemID with otherItemID. this may loss some target items of
it's similars and append some similars to other target items.
was (Author: lariven):
the unit test in the project is at hand to use.
mvn test
-Dtest=org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJobTest
how to reproduce the bug:
step 1, at line 210 add two records to test data:
writeLines(inputFile,
"1,1,1",
"1,4,1",//added
"2,4,1",//added
"1,3,1",
"2,2,1",
"2,3,1",
"3,1,1",
"3,2,1",
"4,1,1",
"4,2,1",
"4,3,1",
"5,2,1",
"6,1,1",
"6,2,1");
step 2, set 231 line maxSimilaritiesPerItem from 1 to 2:
231 TanimotoCoefficat cientSimilarity.class.getName(),
"--maxSimilaritiesPerItem", "2" });
we expect output:
1 2 0.5
1 3 0.4
2 1 0.5
2 3 0.3333333333333333
3 1 0.4
3 4 0.6666666666666666
4 1 0.2
4 3 0.6666666666666666
but output:
1 2 0.5
1 3 0.4
1 4 0.2
2 3 0.3333333333333333
3 4 0.6666666666666666
why:
the weird switch of itemID with otherItemID. this may loss some target items of
it's similars and append same similars to other target items.
> maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
> ------------------------------------------------------------------------
>
> Key: MAHOUT-1739
> URL: https://issues.apache.org/jira/browse/MAHOUT-1739
> Project: Mahout
> Issue Type: Bug
> Components: Collaborative Filtering
> Affects Versions: 0.10.0
> Reporter: lariven
> Labels: easyfix, patch
> Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch
>
>
> the output similar items of ItemSimilarityJob for each target item may exceed
> the number of similar items we set to maxSimilarItemsPerItem parameter. the
> following code of ItemSimilarityJob.java about line NO. 200 may affect:
> if (itemID < otherItemID) {
> ctx.write(new EntityEntityWritable(itemID, otherItemID), new
> DoubleWritable(similarItem.getSimilarity()));
> } else {
> ctx.write(new EntityEntityWritable(otherItemID, itemID), new
> DoubleWritable(similarItem.getSimilarity()));
> }
> Don't know why need to switch itemID with otherItemID, but I think a single
> line is enough:
> ctx.write(new EntityEntityWritable(itemID, otherItemID), new
> DoubleWritable(similarItem.getSimilarity()));
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)