[GitHub] spark pull request: [SPARK-13857][ML][WIP] Add "recommend all" fun...

MLnick Thu, 21 Apr 2016 05:11:16 -0700

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/12574#issuecomment-212886112
  
    Adding some further detail...
    ## Proposed semantics
     
    * `ALSModel.transform` handles both point predictions (predict `rating` for 
each `(user, item)` combo in dataest), and "top-k" recommendations (predict 
`Array(recommendations)` for each `user` in dataset).
    * Add 3 parameters to handle this:
        * `k` - number of recommendations to compute for each unique `id` in 
the input column.
        * `recommendFor` - whether to recommend for `user` (recommends top `k` 
items for each user) or `item` (recommends top `k` users for each item).
        * `withScores` - (expert) whether to return recommendations as `[(id1, 
score1), (id2, score2), ...]` or `[id1, id2, ...]`. Default `false` i.e. ids 
only.
    * Re-use `predictionCol` for recommendations. This means the output schema 
/ semantics to `transform` are different when recommending top-k.
    
    ### Interaction with evaluation
    
    Here I've gone with the approach of having `ALSModel.transform` handle 
munging the input data into the form required for evaluation using 
`RankingEvaluator` (see #12461). So users can currently do the following (using 
`RankingMetrics` until `RankingEvaluator` is merged):
    
    ```scala
    val als = new ALS().setRecommendFor("user").setK(10)
    val results = als.fit(training).transform(test).select("prediction", 
"label")
      .as[(Array[Int], Array[Int])]
      .rdd
    val metrics = new RankingMetrics()
    println(metrics.meanAveragePrecision)
    println(metrics.ndcgAt(als.getK()))
    ...
    ```
    
    Specifically:
    * if the input dataset is the same format as for `ALS.fit`, i.e. it has a 
`user` and `item` column, then `ALSModel.transform` returns **only** a set of 
recommended items **and** a set of "ground truth" items for each user. That is, 
it does a `distinct` on `user` and only returns columns `user`, `prediction`, 
`label` for `(id, recommendations, ground truth)` respectively. So it discards 
any other columns. The reasoning here is that in practice this form will only 
really be used for cross-validation / evaluation (where you only care about the 
`user, recommended, actual` output of `transform`).
    * If the input dataset only has a `user` column, then `ALSModel.transform` 
returns a set of recommended items for each user, along with any other columns 
in the input dataset. The reasoning here is, once you have your `ALSModel` and 
you want to make predictions for, say, a bunch of users, you will only pass in 
a DataFrame with a `user` column (no `item` column, and maybe a bunch of 
columns of user metadata, which in this use case you don't want to discard).
    
    ## Possible alternative semantics
    
    In the current approach, `ALSModel` handles transforming the input data 
into a form suitable for `RankingEvaluator`. The alternative is to instead have 
`RankingEvaluator` do that.
    
    In this case:
    * input dataset to `ALSModel.transform` for recommendations is kept the 
same as for `ALS.fit`, i.e. `[user, item, rating]`.
    * output dataset to `ALSModel.transform` for recommendations looks like 
`[user, item, rating, prediction]` - where `rating` and `prediction` are real 
numbers (not arrays as in the above approach). However, both `rating` and 
`prediction` column can be `null`. A null `prediction` occurs when there is a 
`user, item, rating` combo in the data set, but the `item` does not occur in 
the top-k recommendations for that `user`. A null `rating` occurs when an 
`item` is occurs in the top-k recommendations for a `user`, but that `user, 
item, rating` combo doesn't occur in the dataset.
    * `RankingEvaluator.evaluate` takes the format `[query, document, 
relevance, prediction]`, where `relevance` and `prediction` are nullable. It 
then basically groups by `query` and collects a set of *ground truth* documents 
(from `document` where `relevance` is not null) and a set of *predicted* 
documents (from `document` where `prediction` is not null). Additionally, since 
**order matters** in predictions, `RankingEvaluator` would need to sort the 
predicted set by `prediction` score.
    
    This alternative seems more "DataFrame"-like, but it would require changes 
to `RankingEvaluator`. I have a version of `RankingEvaluator` that works in 
this manner pretty much ready to go, and I've done most of the work on this 
alternative of `ALSModel.transform` too, so this PR can be adjusted in that 
direction quickly.
    
    The downside to this alternative is that performance may suffer as it's 
perhaps not that efficient - since `ALSModel.transform` blocks the user and 
item factors, predicts using BLAS, would then need to explode/flatMap those 
back into the format required. `RankingEvaluator.evaluate` then needs to 
groupBy `user` and collect the two sets of documents, as well as sorting the 
predicted set (again, since we already did that in `ALSModel.transform`).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-13857][ML][WIP] Add "recommend all" fun...

Reply via email to