[
https://issues.apache.org/jira/browse/MAHOUT-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884726#action_12884726
]
Sebastian Schelter commented on MAHOUT-420:
-------------------------------------------
Hi Sean,
I think you're right, optimization is crucial here and there's still a lot of
work to do. I will have a look at the things you pointed out. Meanwhile I will
try to answer your questions and explain the way the job works best I can. I
hope I can explain it good enough so you can join in on this.
{quote}
I'm not fully understanding is handling of NaN. You see what was done before -
NaN values in vectors were used to exclude items from recommendation. It's a
reasonably nice way to do it. What's the equivalent here? I see other bits of
code paying attention to NaN.
{quote}
>From the example output attached one can see that there's always a NaN summand
>mapped for the prediction computation towards items a user already knows, so
>those will be excluded too, as the predicted preference will be NaN too.
>Unfortunately this happens only in the last computation stage, I didn't find a
>way to do it earlier (maybe you see one?).
{quote}
Are we handling "boolean" preferences efficiently? Before it would avoid the
vector-times-preference step when the pref was known to be 1.0, and I don't see
that now.
{quote}
I don't think that the current patch works well for boolean preferences. I will
need some time to investigate that (or maybe you could give me some hints).
{quote}
Finally there is a feature in vectors that will save space, causing it to write
float values instead of doubles, since we don't need 64 bits of precision. I
also don't see how that's preserved.
{quote}
I think changing the doubles to floats in PredictionPartWritable will have the
same effect as the float values in the vectors you mentioned, that will be on
my todo list for an updated patch.
{quote}
Basically I am not yet sure how the new computation is structured from reading
the code. I think some comments on the "Aggregate" jobs would be ideal.
{quote}
I attached the computations of the unit-test example (with the combiner
disabled for clarity), I hope the output can provide a clear picture of how the
computation is done. Please note that this is an unrealistic example as every
item is similar to every other item.
A short summary of the changes:
The cooccurrence matrix has been replaced with the similarity matrix and the
PartialMultiplyMapper and the AggregateAndRecommendReducer have changed partly.
The PartialMultiplyMapper receives the preferences (userIDs and prefValues) as
well as the column from the similarity matrix for an item.
For each preference and similar item it now maps a summand of the numerator and
a summand of the denominator (wrapped in a PredictionPartWritable) of the
formula for the prediction of the preference of the user towards the similar
item:
i = the current item the PartialMultiplyMapper is looking at
u = a user preferring that item
j = an item similar to i (known from the similarity matrix column)
Prediction(u,j) = ... + preference(u,i) * similarity(i,j) + ... / ... +
|similarity(i,j)| + ...
The AggregateAndRecommendReducer receives all PredictionPartWritables for a
user (secondary sorted by item) and can thus compute all the predictions for
the user.
-----------------------------------------
{noformat}
user-item-matrix
burger hotdog berries icecream
dog 5 5 2 -
rabbit 2 - 3 5
cow - 5 - 3
donkey 3 - - 5
item-item-similarity-matrix (tanimoto-coefficient of the item-vectors of the
user-item-matrix)
burger hotdog berries icecream
burger - 0.25 0.66 0.5
hotdog 0.25 - 0.33 0.25
berries 0.66 0.33 - 0.25
icecream 0.5 0.25 0.25 -
Prediction(dog, icecream) = (0.5 * 5 + 0.25 * 5 + 0.25 * 2 ) / (0.5 + 0.25 +
0.25) ~ 4.3
Prediction(rabbit, hotdog) = (0.25 * 2 + 0.33 * 3 + 0.25 * 5) / (0.25 + 0.33 +
0.25) ~ 3,3
Prediction(cow, burger) = (0.25 * 5 + 0.5 * 3) / (0.25 + 0.5)
~ 3,7
Prediction(cow, berries) = (0.33 * 5 + 0.25 * 3) / (0.33 + 0.25)
~ 4,1
Prediction(donkey, hotdog) = (0.25 * 3 + 0.25 * 5) / (0.25 + 0.25)
~ 4
Prediction(donkey, berries) = (0.66 * 3 + 0.25 * 5) / (0.66 + 0.25)
~ 3,6
-----------------------------------------
PartialMultiplyMapper: looking at item [1]
looking at user [1] with preference [5.0]
similar item [1] with similarity [NaN]
similar item [2] with similarity [0.25]
similar item [3] with similarity [0.6666667]
similar item [4] with similarity [0.5]
looking at user [2] with preference [2.0]
similar item [1] with similarity [NaN]
similar item [2] with similarity [0.25]
similar item [3] with similarity [0.6666667]
similar item [4] with similarity [0.5]
looking at user [4] with preference [3.0]
similar item [1] with similarity [NaN]
similar item [2] with similarity [0.25]
similar item [3] with similarity [0.6666667]
similar item [4] with similarity [0.5]
PartialMultiplyMapper: looking at item [2]
looking at user [1] with preference [5.0]
similar item [1] with similarity [0.25]
similar item [2] with similarity [NaN]
similar item [3] with similarity [0.33333334]
similar item [4] with similarity [0.25]
looking at user [3] with preference [5.0]
similar item [1] with similarity [0.25]
similar item [2] with similarity [NaN]
similar item [3] with similarity [0.33333334]
similar item [4] with similarity [0.25]
PartialMultiplyMapper: looking at item [3]
looking at user [1] with preference [2.0]
similar item [1] with similarity [0.6666667]
similar item [2] with similarity [0.33333334]
similar item [3] with similarity [NaN]
similar item [4] with similarity [0.25]
looking at user [2] with preference [3.0]
similar item [1] with similarity [0.6666667]
similar item [2] with similarity [0.33333334]
similar item [3] with similarity [NaN]
similar item [4] with similarity [0.25]
PartialMultiplyMapper: looking at item [4]
looking at user [2] with preference [5.0]
similar item [1] with similarity [0.5]
similar item [2] with similarity [0.25]
similar item [3] with similarity [0.25]
similar item [4] with similarity [NaN]
looking at user [3] with preference [3.0]
similar item [1] with similarity [0.5]
similar item [2] with similarity [0.25]
similar item [3] with similarity [0.25]
similar item [4] with similarity [NaN]
looking at user [4] with preference [5.0]
similar item [1] with similarity [0.5]
similar item [2] with similarity [0.25]
similar item [3] with similarity [0.25]
similar item [4] with similarity [NaN]
AggregateAndRecommendReducer: Computing predictions for user [1]
Predicting preference towards [1]
adding preference*similarity [NaN] to numerator and similarity
[NaN] to denominator
adding preference*similarity [1.25] to numerator and similarity
[0.25] to denominator
adding preference*similarity [1.3333334] to numerator and
similarity [0.6666667] to denominator
Predicted preference is [NaN]
Predicting preference towards [2]
adding preference*similarity [NaN] to numerator and similarity
[NaN] to denominator
adding preference*similarity [1.25] to numerator and similarity
[0.25] to denominator
adding preference*similarity [0.6666667] to numerator and
similarity [0.33333334] to denominator
Predicted preference is [NaN]
Predicting preference towards [3]
adding preference*similarity [3.3333335] to numerator and
similarity [0.6666667] to denominator
adding preference*similarity [1.6666667] to numerator and
similarity [0.33333334] to denominator
adding preference*similarity [NaN] to numerator and similarity
[NaN] to denominator
Predicted preference is [NaN]
Predicting preference towards [4]
adding preference*similarity [1.25] to numerator and similarity
[0.25] to denominator
adding preference*similarity [2.5] to numerator and similarity
[0.5] to denominator
adding preference*similarity [0.5] to numerator and similarity
[0.25] to denominator
Predicted preference is [4.25]
AggregateAndRecommendReducer: Computing predictions for user [2]
Predicting preference towards [1]
adding preference*similarity [2.0] to numerator and similarity
[0.6666667] to denominator
adding preference*similarity [NaN] to numerator and similarity
[NaN] to denominator
adding preference*similarity [2.5] to numerator and similarity
[0.5] to denominator
Predicted preference is [NaN]
Predicting preference towards [2]
adding preference*similarity [1.25] to numerator and similarity
[0.25] to denominator
adding preference*similarity [0.5] to numerator and similarity
[0.25] to denominator
adding preference*similarity [1.0] to numerator and similarity
[0.33333334] to denominator
Predicted preference is [3.3]
Predicting preference towards [3]
adding preference*similarity [1.25] to numerator and similarity
[0.25] to denominator
adding preference*similarity [NaN] to numerator and similarity
[NaN] to denominator
adding preference*similarity [1.3333334] to numerator and
similarity [0.6666667] to denominator
Predicted preference is [NaN]
Predicting preference towards [4]
adding preference*similarity [1.0] to numerator and similarity
[0.5] to denominator
adding preference*similarity [0.75] to numerator and similarity
[0.25] to denominator
adding preference*similarity [NaN] to numerator and similarity
[NaN] to denominator
Predicted preference is [NaN]
AggregateAndRecommendReducer: Computing predictions for user [3]
Predicting preference towards [1]
adding preference*similarity [1.5] to numerator and similarity
[0.5] to denominator
adding preference*similarity [1.25] to numerator and similarity
[0.25] to denominator
Predicted preference is [3.6666667]
Predicting preference towards [2]
adding preference*similarity [NaN] to numerator and similarity
[NaN] to denominator
adding preference*similarity [0.75] to numerator and similarity
[0.25] to denominator
Predicted preference is [NaN]
Predicting preference towards [3]
adding preference*similarity [1.6666667] to numerator and
similarity [0.33333334] to denominator
adding preference*similarity [0.75] to numerator and similarity
[0.25] to denominator
Predicted preference is [4.142857]
Predicting preference towards [4]
adding preference*similarity [1.25] to numerator and similarity
[0.25] to denominator
adding preference*similarity [NaN] to numerator and similarity
[NaN] to denominator
Predicted preference is [NaN]
AggregateAndRecommendReducer: Computing predictions for user [4]
Predicting preference towards [1]
adding preference*similarity [NaN] to numerator and similarity
[NaN] to denominator
adding preference*similarity [2.5] to numerator and similarity
[0.5] to denominator
Predicted preference is [NaN]
Predicting preference towards [2]
adding preference*similarity [0.75] to numerator and similarity
[0.25] to denominator
adding preference*similarity [1.25] to numerator and similarity
[0.25] to denominator
Predicted preference is [4.0]
Predicting preference towards [3]
adding preference*similarity [1.25] to numerator and similarity
[0.25] to denominator
adding preference*similarity [2.0] to numerator and similarity
[0.6666667] to denominator
Predicted preference is [3.5454545]
Predicting preference towards [4]
adding preference*similarity [1.5] to numerator and similarity
[0.5] to denominator
adding preference*similarity [NaN] to numerator and similarity
[NaN] to denominator
Predicted preference is [NaN]
{noformat}
> Improving the distributed item-based recommender
> ------------------------------------------------
>
> Key: MAHOUT-420
> URL: https://issues.apache.org/jira/browse/MAHOUT-420
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Reporter: Sebastian Schelter
> Attachments: MAHOUT-420-2.patch, MAHOUT-420.patch
>
>
> A summary of the discussion on the mailing list:
> Extend the distributed item-based recommender from using only simple
> cooccurrence counts to using the standard computations of an item-based
> recommender as defined in Sarwar et al "Item-Based Collaborative Filtering
> Recommendation Algorithms"
> (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf).
> What the distributed recommender generally does is that it computes the
> prediction values for all users towards all items those users have not rated
> yet. And the computation is done in the following way:
> u = a user
> i = an item not yet rated by u
> N = all items cooccurring with i
> Prediction(u,i) = sum(all n from N: cooccurrences(i,n) * rating(u,n))
> The formula used in the paper which is used by
> GenericItemBasedRecommender.doEstimatePreference(...) too, looks very similar
> to the one above:
> u = a user
> i = an item not yet rated by u
> N = all items similar to i (where similarity is usually computed by
> pairwisely comparing the item-vectors of the user-item matrix)
> Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) / sum(all
> n from N: abs(similarity(i,n)))
> There are only 2 differences:
> a) instead of the cooccurrence count, certain similarity measures like
> pearson or cosine can be used
> b) the resulting value is normalized by the sum of the similarities
> To overcome difference a) we would only need to replace the part that
> computes the cooccurrence matrix with the code from ItemSimilarityJob or the
> code introduced in MAHOUT-418, then we could compute arbitrary similarity
> matrices and use them in the same way the cooccurrence matrix is currently
> used. We just need to separate steps up to creating the co-occurrence matrix
> from the rest, which is simple since they're already different MR jobs.
> Regarding difference b) from a first look at the implementation I think it
> should be possible to transfer the necessary similarity matrix entries from
> the PartialMultiplyMapper to the AggregateAndRecommendReducer to be able to
> compute the normalization value in the denominator of the formula. This will
> take a little work, yes, but is still straightforward. It canbe in the
> "common" part of the process, done after the similarity matrix is generated.
> I think work on this issue should wait until MAHOUT-418 is resolved as the
> implementation here depends on how the pairwise similarities will be computed
> in the future.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.