Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/12574#issuecomment-212886112
Adding some further detail...
## Proposed semantics
* `ALSModel.transform` handles both point predictions (predict `rating` for
each `(user, item)` combo in dataest), and "top-k" recommendations (predict
`Array(recommendations)` for each `user` in dataset).
* Add 3 parameters to handle this:
* `k` - number of recommendations to compute for each unique `id` in
the input column.
* `recommendFor` - whether to recommend for `user` (recommends top `k`
items for each user) or `item` (recommends top `k` users for each item).
* `withScores` - (expert) whether to return recommendations as `[(id1,
score1), (id2, score2), ...]` or `[id1, id2, ...]`. Default `false` i.e. ids
only.
* Re-use `predictionCol` for recommendations. This means the output schema
/ semantics to `transform` are different when recommending top-k.
### Interaction with evaluation
Here I've gone with the approach of having `ALSModel.transform` handle
munging the input data into the form required for evaluation using
`RankingEvaluator` (see #12461). So users can currently do the following (using
`RankingMetrics` until `RankingEvaluator` is merged):
```scala
val als = new ALS().setRecommendFor("user").setK(10)
val results = als.fit(training).transform(test).select("prediction",
"label")
.as[(Array[Int], Array[Int])]
.rdd
val metrics = new RankingMetrics()
println(metrics.meanAveragePrecision)
println(metrics.ndcgAt(als.getK()))
...
```
Specifically:
* if the input dataset is the same format as for `ALS.fit`, i.e. it has a
`user` and `item` column, then `ALSModel.transform` returns **only** a set of
recommended items **and** a set of "ground truth" items for each user. That is,
it does a `distinct` on `user` and only returns columns `user`, `prediction`,
`label` for `(id, recommendations, ground truth)` respectively. So it discards
any other columns. The reasoning here is that in practice this form will only
really be used for cross-validation / evaluation (where you only care about the
`user, recommended, actual` output of `transform`).
* If the input dataset only has a `user` column, then `ALSModel.transform`
returns a set of recommended items for each user, along with any other columns
in the input dataset. The reasoning here is, once you have your `ALSModel` and
you want to make predictions for, say, a bunch of users, you will only pass in
a DataFrame with a `user` column (no `item` column, and maybe a bunch of
columns of user metadata, which in this use case you don't want to discard).
## Possible alternative semantics
In the current approach, `ALSModel` handles transforming the input data
into a form suitable for `RankingEvaluator`. The alternative is to instead have
`RankingEvaluator` do that.
In this case:
* input dataset to `ALSModel.transform` for recommendations is kept the
same as for `ALS.fit`, i.e. `[user, item, rating]`.
* output dataset to `ALSModel.transform` for recommendations looks like
`[user, item, rating, prediction]` - where `rating` and `prediction` are real
numbers (not arrays as in the above approach). However, both `rating` and
`prediction` column can be `null`. A null `prediction` occurs when there is a
`user, item, rating` combo in the data set, but the `item` does not occur in
the top-k recommendations for that `user`. A null `rating` occurs when an
`item` is occurs in the top-k recommendations for a `user`, but that `user,
item, rating` combo doesn't occur in the dataset.
* `RankingEvaluator.evaluate` takes the format `[query, document,
relevance, prediction]`, where `relevance` and `prediction` are nullable. It
then basically groups by `query` and collects a set of *ground truth* documents
(from `document` where `relevance` is not null) and a set of *predicted*
documents (from `document` where `prediction` is not null). Additionally, since
**order matters** in predictions, `RankingEvaluator` would need to sort the
predicted set by `prediction` score.
This alternative seems more "DataFrame"-like, but it would require changes
to `RankingEvaluator`. I have a version of `RankingEvaluator` that works in
this manner pretty much ready to go, and I've done most of the work on this
alternative of `ALSModel.transform` too, so this PR can be adjusted in that
direction quickly.
The downside to this alternative is that performance may suffer as it's
perhaps not that efficient - since `ALSModel.transform` blocks the user and
item factors, predicts using BLAS, would then need to explode/flatMap those
back into the format required. `RankingEvaluator.evaluate` then needs to
groupBy `user` and collect the two sets of documents, as well as sorting the
predicted set (again, since we already did that in `ALSModel.transform`).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]