[
https://issues.apache.org/jira/browse/MAHOUT-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Schelter updated MAHOUT-914:
--------------------------------------
Status: Patch Available (was: Open)
> Provide a non-distributed counterpart of the sampling which is applied in the
> distributed item similarity computation
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-914
> URL: https://issues.apache.org/jira/browse/MAHOUT-914
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Affects Versions: 0.6
> Reporter: Sebastian Schelter
> Assignee: Sebastian Schelter
> Attachments: MAHOUT-914.patch, downsampling.png
>
>
> The distributed item similarity computation applies a so-called
> 'interaction-cut': it selectively down samples 'power users' in
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper. This is
> done because the users with the most interactions usually dominate the
> runtime without providing much benefit to the quality, as users with an
> enormous amount of interactions are very often crawlers or people sharing an
> account.
> Mahout should have an exact counterpart of this strategy for the
> non-distributed code.
> I also attach a figure that shows experiments with this strategy for the
> movielens 1M dataset. The dataset was split into 90% training and 10% test
> set. An interaction cut of size k was applied and the prediction quality
> (using mean average error) was measured. The prediction in the unsampled
> dataset corresponds to using k = 1000 as this is the maximum number of
> interactions per user. We see that with k > 300 the error seems to converge
> and we get a quality that sufficiently replicates the unsampled quality.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira