[ 
https://issues.apache.org/jira/browse/MAHOUT-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164484#comment-13164484
 ] 

Hudson commented on MAHOUT-914:
-------------------------------

Integrated in Mahout-Quality #1234 (See 
[https://builds.apache.org/job/Mahout-Quality/1234/])
    MAHOUT-910 merge ideas from MAHOUT-914, better docs, new no-limit arg, 
different defaults from Sebastian

srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211439
Files : 
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategy.java

                
> Provide a non-distributed counterpart of the sampling which is applied in the 
> distributed item similarity computation
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-914
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-914
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.6
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-914.patch, downsampling.png
>
>
> The distributed item similarity computation applies a so-called 
> 'interaction-cut': it selectively down samples 'power users' in 
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper. This is 
> done because the users with the most interactions usually dominate the 
> runtime without providing much benefit to the quality, as users with an 
> enormous amount of interactions are very often crawlers or people sharing an 
> account.
> Mahout should have an exact counterpart of this strategy for the 
> non-distributed code.
> I also attach a figure that shows experiments with this strategy for the 
> movielens 1M dataset. The dataset was split into 90% training and 10% test 
> set. An interaction cut of size k was applied and the prediction quality 
> (using mean average error) was measured. The prediction in the unsampled 
> dataset corresponds to using k = 1000 as this is the maximum number of 
> interactions per user. We see that with k > 300 the error seems to converge 
> and we get a quality that sufficiently replicates the unsampled quality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to