[
https://issues.apache.org/jira/browse/MAHOUT-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164484#comment-13164484
]
Hudson commented on MAHOUT-914:
-------------------------------
Integrated in Mahout-Quality #1234 (See
[https://builds.apache.org/job/Mahout-Quality/1234/])
MAHOUT-910 merge ideas from MAHOUT-914, better docs, new no-limit arg,
different defaults from Sebastian
srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211439
Files :
*
/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategy.java
> Provide a non-distributed counterpart of the sampling which is applied in the
> distributed item similarity computation
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-914
> URL: https://issues.apache.org/jira/browse/MAHOUT-914
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Affects Versions: 0.6
> Reporter: Sebastian Schelter
> Assignee: Sebastian Schelter
> Attachments: MAHOUT-914.patch, downsampling.png
>
>
> The distributed item similarity computation applies a so-called
> 'interaction-cut': it selectively down samples 'power users' in
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper. This is
> done because the users with the most interactions usually dominate the
> runtime without providing much benefit to the quality, as users with an
> enormous amount of interactions are very often crawlers or people sharing an
> account.
> Mahout should have an exact counterpart of this strategy for the
> non-distributed code.
> I also attach a figure that shows experiments with this strategy for the
> movielens 1M dataset. The dataset was split into 90% training and 10% test
> set. An interaction cut of size k was applied and the prediction quality
> (using mean average error) was measured. The prediction in the unsampled
> dataset corresponds to using k = 1000 as this is the maximum number of
> interactions per user. We see that with k > 300 the error seems to converge
> and we get a quality that sufficiently replicates the unsampled quality.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira