Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17090
I commented further on the
[JIRA](https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15898855&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15898855).
Sorry if my other comments here and on JIRA were unclear. But the proposed
schema for input to `RankingEvaluator` is:
### Schema 1
```
+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
| 230| 318| 5.0| 4.2403245|
| 230| 3424| 4.0| null|
| 230| 81191| null| 4.317455|
+------+-------+------+----------+
```
You will notice that `rating` and `prediction` columns can be `null`. This
is by design. There are three cases shown above:
1. 1st row indicates a (user-item) pair that occurs in *both* the
ground-truth set *and* the top-k predictions;
2. 2nd row indicates a (user-item) pair that occurs in the ground-truth
set, *but not* in the top-k predictions;
3. 3rd row indicates a (user-item) pair that occurs in the top-k
predictions, *but not* in the ground-truth set.
_Note_ for reference, the input to the current `mllib` `RankingMetrics` is:
### Schema 2
```
RDD[(true labels array, predicted labels array)],
i.e.
RDD of ([318, 3424, 7139,...], [81191, 93040, 31...])
```
(So actually neither of the above schemas are easily compatible with the
return schema here - but I think it is not really necessary to match the
`mllib.RankingMetrics` format)
### ALS cross-validation
My proposal for fitting ALS into cross-validation is the
`ALSModel.transform` will output a DF of **Schema 1** - *only* when the
parameters `k` and `recommendFor` are appropriately set, and the input DF
contains both `user` and `item` columns. In practice, this scenario will occur
during cross-validation only.
So what I am saying is that ALS itself (not the evaluator) must know how to
return the correct DataFrame output from `transform` such that it can be used
in a cross-validation as input to the `RankingEvaluator`.
__Concretely:__
```scala
val als = new ALS().setRecommendFor("user").setK(10)
val validator = new TrainValidationSplit()
.setEvaluator(new RankingEvaluator().setK(10))
.setEstimator(als)
.setEstimatorParamMaps(...)
val bestModel = validator.fit(ratings)
```
So while it is complex under the hood - to users it's simply a case of
setting 2 params and the rest is as normal.
Now, we have the best model selected by cross-validation. We can make
recommendations using these convenience methods (I think it will need a cast):
```scala
val recommendations =
bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10)
```
Alternatively, the `transform` version looks like this:
```scala
val usersDF = ...
+------+
|userId|
+------+
| 1|
| 2|
| 3|
+------+
val recommendations = bestModel.transform(usersDF)
```
So the questions:
1. should we support the above `transform`-based recommendations? Or only
support it for cross-validation purposes as a special case?
2. if we do, what should the output schema of the above `transform` version
look like? It must certainly match the output of `recommendX` methods.
The options are:
(1) The schema in this PR:
**Pros**: as you mention above - also more "compact"
**Cons**: doesn't match up so closely with the `transform`
"cross-validation" schema above
(2) The schema below. It is basically an "exploded" version of option (1)
```
+------+-------+----------+
|userId|movieId|prediction|
+------+-------+----------+
| 1| 1| 4.3|
| 1| 5| 3.2|
| 1| 9| 2.1|
+------+-------+----------+
```
**Pros***: matches more closely with the cross-validation / evaluator input
format. Perhaps slightly more "dataframe-like".
**Cons**: less compact; lose ordering?; may require more munging to save to
external data stores etc.
Anyway sorry for hijacking this PR discussion - but as I think you can see,
the evaluator / ALS transform interplay is a bit subtle and requires some
thought to get the right approach.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]