[ https://issues.apache.org/jira/browse/MAHOUT-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175838#comment-13175838 ]
Sean Owen commented on MAHOUT-906: ---------------------------------- After looking at this more I'm not sure this is the right refactoring. The time-based implementation is parameterized by four times. Shouldn't it be 1? before that time is training data, after that is test? It still splits on a relevance threshold - shouldn't it just be based on time? Splitting this logic out, so that the relevance-threshold business only exists in one implementation, will take more surgery than this. For example, evaluate() shouldn't have a relevance threshold parameter anymore, then, since that would be a property of one particular splitter strategy, specified in the constructor. The resulting time-based implementation will then not be quite so much copy and paste, which is good, as it's a symptom of this not quite splitting up the logic completely. I think that would make the change worth committing. While I can and have changed it locally, you'll want to watch the spacing, formatting, and ensure the copyright headers are in place. Also the implementations should live in .impl.eval, not .eval. It would be better still to also refactor the test/training data split as well, not just relevant items. Consider that a bonus. > Allow collaborative filtering evaluators to use custom logic in splitting > data set > ---------------------------------------------------------------------------------- > > Key: MAHOUT-906 > URL: https://issues.apache.org/jira/browse/MAHOUT-906 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Affects Versions: 0.5 > Reporter: Anatoliy Kats > Priority: Minor > Labels: features > Attachments: MAHOUT-906.patch, MAHOUT-906.patch, MAHOUT-906.patch, > MAHOUT-906.patch, MAHOUT-906.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > I want to start a discussion about factoring out the logic used in splitting > the data set into training and testing. Here is how things stand: There are > two independent evaluator based classes: > AbstractDifferenceRecommenderEvaluator, splits all the preferences randomly > into a training and testing set. GenericRecommenderIRStatsEvaluator takes > one user at a time, removes their top AT preferences, and counts how many of > them the system recommends back. > I have two use cases that both deal with temporal dynamics. In one case, > there may be expired items that can be used for building a training model, > but not a test model. In the other, I may want to simulate the behavior of a > real system by building a preference matrix on days 1-k, and testing on the > ratings the user generated on the day k+1. In this case, it's not items, but > preferences(user, item, rating triplets) which may belong only to the > training set. Before we discuss appropriate design, are there any other use > cases we need to keep in mind? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira