[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237067#comment-15237067
 ] 

Nick Pentreath commented on SPARK-13857:
----------------------------------------

My main point is that in cross-validation, essentially the "problem" is we need 
the input dataset to contain the ground truth "actual" column for each unique 
user (not each row in the original input DF). The format of the input DF for 
{{fit}} is not compatible with that of (the proposed) 
{{RankingEvaluator.evaluate}}, and {{TrainValidateSplit.fit}} takes only one DF 
(which is passed to both {{Estimator.fit}} and {{Evaluator.evaluate}}), e.g.

{{code}}
// input DF for ALS.fit
+----+----+------+
|user|item|rating|
+----+----+------+
|   1|  10|   1.0|
|   1|  20|   3.0|
|   1|  30|   5.0|
|   2|  20|   5.0|
|   2|  40|   3.0|
|   3|  10|   5.0|
|   3|  30|   4.0|
|   3|  40|   1.0|
+----+----+------+
// input DF for RankingEvaluator.evaluate
+----+----------+---------+
|user|   topk   |  actual |
+----+----------+---------+
|   1|  [10, 20]| [40, 20]|
|   2|  [30, 40]| [30, 10]|
|   3|  [20, 30]| [20, 30]|
+----+----------+---------+
{code}

My point in #2 above, was the we could have ALS handle it:
{code}
val input: DataFrame = ...
+----+----+------+
|user|item|rating|
+----+----+------+
|   1|  10|   1.0|
|   1|  20|   3.0|
|   1|  30|   5.0|
|   2|  20|   5.0|
|   2|  40|   3.0|
|   3|  10|   5.0|
|   3|  30|   4.0|
|   3|  40|   1.0|
+----+----+------+
val model = als.fit(input)
val predictions = 
model.setK(2).setUserTopKCol("user").setWithActual(true).transform(input)
+----+----------+---------+
|user|   topk   |  actual |
+----+----------+---------+
|   1|  [10, 20]| [40, 20]|
|   2|  [30, 40]| [30, 10]|
|   3|  [20, 30]| [20, 30]|
+----+----------+---------+
evaluator.setLabelCol("actual").setPredictionCol("topk").evaluate(predictions)
{code}

.. but this requires the input DF to {{transform}} to be the same as for 
{{fit}} , and requires some processing of that DF which adds some overhead 
(e.g. grouping by user to get the ground truth items for each user id, and 
{{input.select("user").distinct}}). However, this overhead is unavoidable for 
evaluation at least, as one does need to compute the ground truth and the 
unique user set for making recommendations. This is not "natural" for the case 
when you just want to make recommendations (e.g. using the best model from 
evaluation), since you'd normally just want to pass in a DF of users to top-k:
{code}
val input: DataFrame = ...
+----+
|user|
+----+
|   1|
|   3|
|   2|
+----+
model.setK(2).setUserTopKCol("user").transform(input).show
+----+----------+
|user|topk      |
+----+----------+
|   1|  [10, 20]|
|   2|  [30, 40]|
|   3|  [20, 30]|
+----+----------+
{code}

So overall it just feels a little clunky. It feels like it will be somewhat 
tricky for users to tweak the correct param settings to get it to work, but 
perhaps it's the best approach, combined with a good example in the docs. Also 
[~josephkb] was concerned about different type for the prediction column 
depending on params - but I'd propose we have a separate column for top-k and 
set the column in the evaluator accordingly (as in example above).

> Feature parity for ALS ML with MLLIB
> ------------------------------------
>
>                 Key: SPARK-13857
>                 URL: https://issues.apache.org/jira/browse/SPARK-13857
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Nick Pentreath
>            Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to