Looking at https://github.com/apache/spark/blob/814a9cd7fabebf2a06f7e2e5d46b6a2b28b917c2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L82
For each user in test set, you generate an Array of top K predicted item ids (Int or String probably), and an Array of ground truth item ids (the known rated or liked items in the test set for that user), and pass that to precisionAt(k) to compute MAP@k (Actually this method name is a bit misleading - it should be meanAveragePrecisionAt where the other method there is without a cutoff at k. However, both compute MAP). The challenge at scale is actually computing all the top Ks for each user, as it requires broadcasting all the item factors (unless there is a smarter way?) I wonder if it is possible to extend the DIMSUM idea to computing top K matrix multiply between the user and item factor matrices, as opposed to all-pairs similarity of one matrix? On Thu, Oct 30, 2014 at 5:28 AM, Debasish Das <debasish.da...@gmail.com> wrote: > Is there an example of how to use RankingMetrics ? > > Let's take the user, document example...we get user x topic and document x > topic matrices as the model... > > Now for each user, we can generate topK document by doing a sort on (1 x > topic)dot(topic x document) and picking topK... > > Is it possible to validate such a topK finding algorithm using > RankingMetrics ? > > > On Wed, Oct 29, 2014 at 12:14 PM, Xiangrui Meng <men...@gmail.com> wrote: > > > Let's narrow the context from matrix factorization to recommendation > > via ALS. It adds extra complexity if we treat it as a multi-class > > classification problem. ALS only outputs a single value for each > > prediction, which is hard to convert to probability distribution over > > the 5 rating levels. Treating it as a binary classification problem or > > a ranking problem does make sense. The RankingMetricc is in master. > > Free free to add prec@k and ndcg@k to examples.MovielensALS. ROC > > should be good to add as well. -Xiangrui > > > > > > On Wed, Oct 29, 2014 at 11:23 AM, Debasish Das <debasish.da...@gmail.com > > > > wrote: > > > Hi, > > > > > > In the current factorization flow, we cross validate on the test > dataset > > > using the RMSE number but there are some other measures which are worth > > > looking into. > > > > > > If we consider the problem as a regression problem and the ratings 1-5 > > are > > > considered as 5 classes, it is possible to generate a confusion matrix > > > using MultiClassMetrics.scala > > > > > > If the ratings are only 0/1 (like from the spotify demo from spark > > summit) > > > then it is possible to use Binary Classification Metrices to come up > with > > > the ROC curve... > > > > > > For topK user/products we should also look into prec@k and pdcg@k as > the > > > metric.. > > > > > > Does it make sense to add the multiclass metric and prec@k, pdcg@k in > > > examples.MovielensALS along with RMSE ? > > > > > > Thanks. > > > Deb > > >