[ 
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584549#comment-14584549
 ] 

lariven edited comment on MAHOUT-1739 at 6/13/15 11:46 AM:
-----------------------------------------------------------

the unit test in the project is at hand to use.
mvn test 
-Dtest=org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJobTest

how to reproduce the bug:
 step 1, at line 210 add two records to test data:

    writeLines(inputFile,
        "1,1,1",
            "1,4,1",//added
            "2,4,1",//added
        "1,3,1",
        "2,2,1",
        "2,3,1",
        "3,1,1",
        "3,2,1",
        "4,1,1",
        "4,2,1",
        "4,3,1",
        "5,2,1",
        "6,1,1",
        "6,2,1");

 step 2, set 231 line maxSimilaritiesPerItem from 1 to 2:
231        TanimotoCoefficat cientSimilarity.class.getName(), 
"--maxSimilaritiesPerItem", "2" });

we expect output:
1       2       0.5
1       3       0.4
2       1       0.5
2       3       0.3333333333333333
3       1       0.4
3       4       0.6666666666666666
4       1       0.2
4       3       0.6666666666666666


but output:
1       2       0.5
1       3       0.4
1       4       0.2
2       3       0.3333333333333333
3       4       0.6666666666666666


why:

the weird switch of itemID with otherItemID. this may loss some target items of 
it's similars and append some similars to other target items.


was (Author: lariven):
the unit test in the project is at hand to use.
mvn test 
-Dtest=org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJobTest

how to reproduce the bug:
 step 1, at line 210 add two records to test data:

    writeLines(inputFile,
        "1,1,1",
            "1,4,1",//added
            "2,4,1",//added
        "1,3,1",
        "2,2,1",
        "2,3,1",
        "3,1,1",
        "3,2,1",
        "4,1,1",
        "4,2,1",
        "4,3,1",
        "5,2,1",
        "6,1,1",
        "6,2,1");

 step 2, set 231 line maxSimilaritiesPerItem from 1 to 2:
231        TanimotoCoefficat cientSimilarity.class.getName(), 
"--maxSimilaritiesPerItem", "2" });

we expect output:
1       2       0.5
1       3       0.4
2       1       0.5
2       3       0.3333333333333333
3       1       0.4
3       4       0.6666666666666666
4       1       0.2
4       3       0.6666666666666666


but output:
1       2       0.5
1       3       0.4
1       4       0.2
2       3       0.3333333333333333
3       4       0.6666666666666666


why:

the weird switch of itemID with otherItemID. this may loss some target items of 
it's similars and append same similars to other target items.

> maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-1739
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1739
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.10.0
>            Reporter: lariven
>              Labels: easyfix, patch
>         Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch
>
>
> the output similar items of ItemSimilarityJob for each target item may exceed 
> the number of similar items we set to maxSimilarItemsPerItem  parameter. the 
> following code of ItemSimilarityJob.java about line NO. 200 may affect:
>         if (itemID < otherItemID) {
>           ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));
>         } else {
>           ctx.write(new EntityEntityWritable(otherItemID, itemID), new 
> DoubleWritable(similarItem.getSimilarity()));
>         }
> Don't know why need to switch itemID with otherItemID, but I think a single 
> line is enough:
>           ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to