[ 
https://issues.apache.org/jira/browse/MAHOUT-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-914:
--------------------------------------

    Resolution: Duplicate
        Status: Resolved  (was: Patch Available)

will be included in MAHOUT-910
                
> Provide a non-distributed counterpart of the sampling which is applied in the 
> distributed item similarity computation
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-914
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-914
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.6
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-914.patch, downsampling.png
>
>
> The distributed item similarity computation applies a so-called 
> 'interaction-cut': it selectively down samples 'power users' in 
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper. This is 
> done because the users with the most interactions usually dominate the 
> runtime without providing much benefit to the quality, as users with an 
> enormous amount of interactions are very often crawlers or people sharing an 
> account.
> Mahout should have an exact counterpart of this strategy for the 
> non-distributed code.
> I also attach a figure that shows experiments with this strategy for the 
> movielens 1M dataset. The dataset was split into 90% training and 10% test 
> set. An interaction cut of size k was applied and the prediction quality 
> (using mean average error) was measured. The prediction in the unsampled 
> dataset corresponds to using k = 1000 as this is the maximum number of 
> interactions per user. We see that with k > 300 the error seems to converge 
> and we get a quality that sufficiently replicates the unsampled quality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to