Repository: incubator-hivemall Updated Branches: refs/heads/master 02885f89f -> aafd0f1e3
Close #86: [HIVEMALL-113] Add MovieLens item-based CF tutorial Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/aafd0f1e Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/aafd0f1e Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/aafd0f1e Branch: refs/heads/master Commit: aafd0f1e35638e71271b0bd392423d3673d9b454 Parents: 02885f8 Author: Takuya Kitazawa <[email protected]> Authored: Thu Jun 15 17:56:03 2017 +0900 Committer: Makoto Yui <[email protected]> Committed: Thu Jun 15 17:56:03 2017 +0900 ---------------------------------------------------------------------- docs/gitbook/SUMMARY.md | 1 + docs/gitbook/recommend/movielens_cf.md | 256 ++++++++++++++++++++++++++++ 2 files changed, 257 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/aafd0f1e/docs/gitbook/SUMMARY.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md index a4dccda..638b77b 100644 --- a/docs/gitbook/SUMMARY.md +++ b/docs/gitbook/SUMMARY.md @@ -146,6 +146,7 @@ * [MovieLens movie recommendation Tutorial](recommend/movielens.md) * [Data preparation](recommend/movielens_dataset.md) + * [Item-based Collaborative Filtering](recommend/movielens_cf.md) * [Matrix Factorization](recommend/movielens_mf.md) * [Factorization Machine](recommend/movielens_fm.md) * [10-fold Cross Validation (Matrix Factorization)](recommend/movielens_cv.md) http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/aafd0f1e/docs/gitbook/recommend/movielens_cf.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/recommend/movielens_cf.md b/docs/gitbook/recommend/movielens_cf.md new file mode 100644 index 0000000..e0ed545 --- /dev/null +++ b/docs/gitbook/recommend/movielens_cf.md @@ -0,0 +1,256 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +[Our user guide for item-based collaborative filtering (CF)](item_based_cf.md) introduced how to make recommendation based on item-item similarities. Here, we particularly focus on [DIMSUM](item_based_cf.html#dimsum-approximated-all-pairs-cosine-similarity-computation), an efficient and approximated similarity computation scheme, and try to make recommendation from the MovieLens data. + +<!-- toc --> + +# Compute movie-movie similarity + +[As we explained in the general introduction of item-based CF](item_based_cf.html#dimsum-approximated-all-pairs-cosine-similarity-computation.md), following query finds top-$$k$$ nearest-neighborhood movies for each movie: + +```sql +drop table if exists dimsum_movie_similarity; +create table dimsum_movie_similarity +as +with movie_magnitude as ( -- compute magnitude of each movie vector + select + to_map(j, mag) as mags + from ( + select + movieid as j, + l2_norm(rating) as mag + from + training + group by + movieid + ) t0 +), +movie_features as ( + select + userid as i, + collect_list( + feature(movieid, rating) + ) as feature_vector + from + training + group by + userid +), +partial_result as ( -- launch DIMSUM in a MapReduce fashion + select + dimsum_mapper(f.feature_vector, m.mags, '-threshold 0.1 -disable_symmetric_output') + as (movieid, other, s) + from + movie_features f + left outer join movie_magnitude m +), +similarity as ( -- reduce (i.e., sum up) mappers' partial results + select + movieid, + other, + sum(s) as similarity + from + partial_result + group by + movieid, other +), +topk as ( + select + each_top_k( -- get top-10 nearest neighbors based on similarity score + 10, movieid, similarity, + movieid, other -- output items + ) as (rank, similarity, movieid, other) + from ( + select * from similarity + CLUSTER BY movieid + ) t +) +select + movieid, other, similarity +from + topk +; +``` + +| movieid | other | similarity | +|:---:|:---:|:---| +|1 | 2095 | 0.9377422722094696 | +|1 | 231 | 0.9316530366756418 | +|1 | 1407 | 0.9194745656079863 | +|1 | 3442 | 0.9133853300741587 | +|1 | 1792 | 0.9072960945403309 | +|...|...|...| + +Since we set `k=10`, output has 10 most-similar movies per `movieid`. + +> #### Note +> Since we specified an option `-disable_symmetric_output`, output table does not contain inverted similarities such as `<2095, 1>`, `<231, 1>`, `<1407, 1>`, ... + +# Prediction + +Next, we predict rating for unforeseen user-movie pairs based on the top-$$k$$ similarities: + +```sql +drop table if exists dimsum_prediction; +create table dimsum_prediction +as +with similarity_all as ( + -- copy (i1, i2)'s similarity as (i2, i1)'s one + select movieid, other, similarity from dimsum_movie_similarity + union all + select other as movieid, movieid as other, similarity from dimsum_movie_similarity +) +select + -- target user + t1.userid, + + -- recommendation candidate + t2.movieid, + + -- predicted rating: r_{u,i} = sum(s_{i,:} * r_{u,:}) / sum(s_{i,:}) + sum(t1.rating * t2.similarity) / sum(t2.similarity) as rating +from + training t1 -- r_{u,<movieid>} +left join -- s_{i,<other>} + similarity_all t2 + on t1.movieid = t2.other +where + -- do not include movies that user already rated + NOT EXISTS ( + SELECT a.movieid FROM training a + WHERE a.userid = t1.userid AND a.movieid = t2.movieid + ) +group by + t1.userid, t2.movieid +; +``` + +This query computes estimated rating as follows: + +|userid|movieid|rating| +|:---:|:---:|:---| +|1 | 1000 | 5.0 | +|1 | 1010 | 5.0 | +|1 | 1012 | 4.246349332667371 | +|1 | 1013 | 5.0 | +|1 | 1014 | 5.0 | +| ... | ... | ... | + +Theoretically, for the $$t$$-th nearest-neighborhood item $$\tau(t)$$, prediction can be done by top-$$k$$ weighted sum of target user's historical ratings: +$$ +r_{u,i} = \frac{\sum^k_{t=1} s_{i,\tau(t)} \cdot r_{u,\tau(t)} }{ \sum^k_{t=1} s_{i,\tau(t)} }, +$$ +where $$r_{u,i}$$ is user $$u$$'s rating for item (movie) $$i$$, and $$s_{i,j}$$ is similarity of $$i$$-$$j$$ (`movieid`-`other`) pair. + +> #### Caution +> Since the number of similarities and users' past ratings are limited, we cannot say this output **always** contains prediction for **every** unforeseen user-item pairs; sometimes prediction for a specific user-item pair might be missing (i.e., `NULL`). + +In fact, our goal is to make recommendation, but we can evaluate the intermediate result as a rating prediction problem: + +```sql +select + mae(t1.rating, t2.rating) as mae, + rmse(t1.rating, t2.rating) as rmse +from + testing t1 +left join + dimsum_prediction t2 + on t1.movieid = t2.movieid +where + t1.userid = t2.userid +; +``` + +| mae | rmse | +|:---:|:---:| +|0.7308365821230256 | 0.9594799959251938 | + +Rating of the MovieLens data is in `[1, 5]` range, so this average errors are reasonable as a predictor. + +# Recommendation + +By using the prediction table, making recommendation for each user is straightforward: + +```sql +drop table if exists dimsum_recommendation; +create table dimsum_recommendation +as +select + userid, + map_values(to_ordered_map(rating, movieid, true)) as rec_movies +from + dimsum_prediction +group by + userid +; +``` + +|userid | rec_movies | +|:---:|:---| +|1 | ["2590","999","372","1380","2078",...] +|2 | ["580","945","43","36","1704",...] +|3 | ["3744","852","1610","3740","2915",...] +|4 | ["3379","923","1997","2194","2944",...] +|5 | ["998","101","2696","2968","2275",...] | +| ... | ... | + +> #### Note +> Size of `rec_movies` varies depending on each user's `training` samples and what movies he/she already rated. + +# Evaluation + +Eventually, you can measure the quality of recommendation by using [ranking measures](../eval/rank.md): + +```sql +with truth as ( + select + userid, + map_values(to_ordered_map(rating, cast(movieid as string), true)) as truth + from + testing + group by + userid +) +select + recall(t1.rec_movies, t2.truth, 10) as recall, + precision(t1.rec_movies, t2.truth, 10) as precision, + average_precision(t1.rec_movies, t2.truth) as average_precision, + auc(t1.rec_movies, t2.truth) as auc, + mrr(t1.rec_movies, t2.truth) as mrr, + ndcg(t1.rec_movies, t2.truth) as ndcg +from + dimsum_recommendation t1 +join + truth t2 on t1.userid = t2.userid +where -- at least 10 recommended items are necessary to compute recall@10 and precision@10 + size(t1.rec_movies) >= 10 +; +``` + +| measure | accuracy | +|:---:|:---| +|**Recall@10**| 0.027033598585322713 | +|**Precision@10**| 0.009001989389920506 | +|**Average Precision** | 0.017363681149831108 | +|**AUC**| 0.5264553136097863 | +|**MRR**| 0.03507380742291146 | +|**NDCG**| 0.15787655209987522 | + +If you set larger value to the DIMSUM's `-threshold` option, similarity will be more aggressively approximated. Consequently, while efficiency is improved, the accuracy is likely to be decreased. \ No newline at end of file
