[
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896933#comment-15896933
]
Nick Pentreath edited comment on SPARK-14409 at 3/6/17 9:06 AM:
----------------------------------------------------------------
I've thought about this a lot over the past few days, and I think the approach
should be in line with that suggested by [~roberto.mirizzi] & [~danilo.ascione].
*Goal*
Provide a DataFrame-based ranking evaluator that is general enough to handle
common scenarios such as recommendations (ALS), search ranking, ad click
prediction using ranking metrics (e.g. recent Kaggle competitions for
illustration: [Outbrain Ad Clicks using
MAP|https://www.kaggle.com/c/outbrain-click-prediction#evaluation], [Expedia
Hotel Search Ranking using
NDCG|https://www.kaggle.com/c/expedia-personalized-sort#evaluation]).
*RankingEvaluator input format*
{{evaluate}} would take a {{DataFrame}} with columns:
* {{queryCol}} - the column containing "query id" (e.g. "query" for cases such
as search ranking; "user" for recommendations; "impression" for ad click
prediction/ranking, etc);
* {{documentCol}} - the column containing "document id" (e.g. "document" in
search, "item" in recommendation, "ad" in ad ranking, etc);
* {{labelCol}} (or maybe {{relevanceCol}} to be more precise) - the column
containing the true relevance score for a query-document pair (e.g. in
recommendations this would be the "rating"). This column will only be used for
filtering out "irrelevant" documents from the ground-truth set (see Param
{{goodThreshold}} mentioned
[above|https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15826901&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15826901)]);
* {{predictionCol}} - the column containing the predicted relevance score for a
query-document pair. The predicted ids will be ordered by this column for
computing ranking metrics (for which order matters in predictions but generally
not for ground-truth which is treated as a set).
The reasoning is that this format is flexible & generic enough to encompass the
diverse use cases mentioned above.
Here is an illustrative example from recommendations as a special case:
{code}
+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
| 230| 318| 5.0| 4.2403245|
| 230| 3424| 4.0| null|
| 230| 81191| null| 4.317455|
+------+-------+------+----------+
{code}
You will notice that {{rating}} and {{prediction}} columns can be {{null}}.
This is by design. There are three cases shown above:
# 1st row indicates a query-document (user-item) pair that occurs in *both* the
ground-truth set and the top-k predictions;
# 2nd row indicates a user-item pair that occurs in the ground-truth set, but
*not* in the top-k predictions;
# 3rd row indicates a user-item pair that *does not* occur in the ground-truth
set, but *does* occur in the top-k predictions;
*Note* that while technically the input allows both these columns to be
{{null}} in practice that won't occur since a query-document pair must occur in
at least one of the ground-truth set or predictions. If it does occur for some
reason it can be ignored.
*Evaluator approach*
The evaluator will perform a window function over {{queryCol}} and order by
{{predictionCol}} within each query. Then, {{collect_list}} can be used to
arrive at the following intermediate format:
{code}
+------+--------------------+--------------------+
|userId| true_labels| predicted_labels|
+------+--------------------+--------------------+
| 230|[318, 3424, 7139,...|[81191, 93040, 31...|
+------+--------------------+--------------------+
{code}
*Relationship to RankingMetrics*
Technically the intermediate format above is the same format as used for
{{RankingMetrics}}, and perhaps we could simple wrap the {{mllib}} version.
*Note* however that the {{mllib}} class is parameterized by the type of
"document": {code}RankingMetrics[T]{code}
I believe for the generic case we must support both {{NumericType}} and
{{StringType}} for id columns (rather than restricting to {{Int}} as in Danilo
& Roberto versions above). So either:
# the evaluator must be similarly parameterized; or
# we will need to re-write the ranking metrics computations as UDFs as follows:
{code} udf { (predicted: Seq[Any], actual: Seq[Any]) => ... {code}
I strongly prefer option #2 as it is more flexible and in keeping with the
DataFrame style of Spark ML components (as a side note, this will give us a
chance to review the implementations & naming of metrics, since there are some
issues with a few of the metrics).
That is my proposal (sorry Yong, this is quite different now from the work
you've done in your PR). If Yong or Danilo has time to update his PR in this
direction, let me know.
cc [~josephkb] FYI
Thanks!
was (Author: mlnick):
I've thought about this a lot over the past few days, and I think the approach
should be in line with that suggested by [~roberto.mirizzi] & [~danilo.ascione].
*Goal*
Provide a DataFrame-based ranking evaluator that is general enough to handle
common scenarios such as recommendations (ALS), search ranking, ad click
prediction using ranking metrics (e.g. recent Kaggle competitions for
illustration: [Outbrain Ad Clicks using
MAP|https://www.kaggle.com/c/outbrain-click-prediction#evaluation], [Expedia
Hotel Search Ranking using
NDCG|https://www.kaggle.com/c/expedia-personalized-sort#evaluation]).
*RankingEvaluator input format*
{{evaluate}} would take a {{DataFrame}} with columns:
* {{queryCol}} - the column containing "query id" (e.g. "query" for cases such
as search ranking; "user" for recommendations; "impression" for ad click
prediction/ranking, etc);
* {{documentCol}} - the column containing "document id" (e.g. "document" in
search, "item" in recommendation, "ad" in ad ranking, etc);
* {{labelCol}} (or maybe {{relevanceCol}} to be more precise) - the column
containing the true relevance score for a query-document pair (e.g. in
recommendations this would be the "rating"). This column will only be used for
filtering out "irrelevant" documents from the ground-truth set (see Param
{{goodThreshold}} mentioned
[above|https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15826901&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15826901)]);
* {{predictionCol}} - the column containing the predicted relevance score for a
query-document pair. The predicted ids will be ordered by this column for
computing ranking metrics (for which order matters in predictions but generally
not for ground-truth which is treated as a set).
The reasoning is that this format is flexible & generic enough to encompass the
diverse use cases mentioned above.
Here is an illustrative example from recommendations as a special case:
{code}
+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
| 230| 318| 5.0| 4.2403245|
| 230| 3424| 4.0| null|
| 230| 81191| null| 4.317455|
+------+-------+------+----------+
{code}
You will notice that {{rating}} and {{prediction}} columns can be {{null}}.
This is by design. There are three cases shown above:
# 1st row indicates a query-document (user-item) pair that occurs in *both* the
ground-truth set and the top-k predictions;
# 2nd row indicates a user-item pair that occurs in the ground-truth set, but
*not* in the top-k predictions;
# 3rd row indicates a user-item pair that *does not* occur in the ground-truth
set, but *does* occur in the top-k predictions;
*Note* that while technically the input allows both these columns to be
{{null}} in practice that won't occur since a query-document pair must occur in
at least one of the ground-truth set or predictions. If it does occur for some
reason it can be ignored.
*Evaluator approach*
The evaluator will perform a window function over {{queryCol}} and order by
{{predictionCol}} within each query. Then, {{collect_list}} can be used to
arrive at the following intermediate format:
{code}
+------+--------------------+--------------------+
|userId| true_labels| predicted_labels|
+------+--------------------+--------------------+
| 230|[318, 3424, 7139,...|[81191, 93040, 31...|
+------+--------------------+--------------------+
{code}
*Relationship to RankingMetrics*
Technically the intermediate format above is the same format as used for
{{RankingMetrics}}, and perhaps we could simple wrap the {{mllib}} version.
*Note* however that the {{mllib}} class is parameterized by the type of
"document": {code}RankingMetrics[T]{code}
I believe for the generic case we must support both {{NumericType}} and
{{StringType}} for id columns (rather than restricting to {{Int}} as in Danilo
& Roberto versions above). So either:
# the evaluator must be similarly parameterized; or
# we will need to re-write the ranking metrics computations as UDFs as follows:
{code} udf { (predicted: Seq[Any], actual: Seq[Any]) => ... {code}
I strongly prefer option #2 as it is more flexible and in keeping with the
DataFrame style of Spark ML components (as a side note, this will give us a
chance to review the implementations & naming of metrics, since there are some
issues with a few of the metrics).
That is my proposal (sorry Yong, this is quite different now from the work
you've done in your PR). If Yong or Danilo has time to update his PR in this
direction, let me know.
Thanks!
> Investigate adding a RankingEvaluator to ML
> -------------------------------------------
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Reporter: Nick Pentreath
> Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful
> for recommendation evaluation (and can be useful in other settings
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]