[
https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved MAHOUT-430.
------------------------------
Assignee: Sean Owen
Resolution: Not A Problem
But once you fall off the end of one, the remaining data points in the other
array are irrelevant, since they have no counterpart in the one that ended.
These metrics are only computed over the item/users that are common to both.
You can write a test to double-check but after looking at it again I'm certain
it's correct.
It's probably the fastest implementation possible, and because of that, not the
shortest. I don't think it's too messy, but have a crack at improving it while
not comprising speed if you like. This is the best I could do after years of
looking at this method.
The duplication is unfortunate but hard to refactor because they're doing
slightly different things actually. Again open to suggestions but I have stared
at this for a long time and, given the priority for speed, not been able to do
better.
> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
> Key: MAHOUT-430
> URL: https://issues.apache.org/jira/browse/MAHOUT-430
> Project: Mahout
> Issue Type: Bug
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Emerson Murphy-HIll
> Assignee: Sean Owen
>
> Looking at the userSimilarity and itemSimilarity methods in
> AbstractSimilarity, both compute metrics over each User's/Tool's
> PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through
> each PreferenceArray in a single loop, comparing indexes to make sure we
> don't fall off the end. Eventually, we get to the end of an array, which is
> caught here:
> if (compare <= 0) {
> if (++xPrefIndex >= xLength) {
> break;
> }
> ...
> The problem is, the metrics may not be correct when the break occurs.
> Specifically, for the other array, the one that we *didn't* fall off the end
> of, the metrics don't reflect the preferences we have not yet visited. In the
> example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to
> do something like this:
> if (compare <= 0) {
> if (++xPrefIndex >= xLength) {
> sumY2 += squareSumRest(yPrefs,yPrefIndex);
> break;
> }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
> double squareSum = 0;
> for(int i = startingFrom; i < preferences.length; i++){
> double val = preferences[i].getValue();
> squareSum += val*val;
> }
> return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably
> sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.