[
https://issues.apache.org/jira/browse/FLINK-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15629109#comment-15629109
]
Gábor Hermann commented on FLINK-4713:
--------------------------------------
We have managed to rework the evaluation framework proposed by Theodore, so
that ranking predictions would fit in. Our approach is to use separate
{{RankingPredictor}} and {{Predictor}} traits. One main problem however
remains: there is no common superclass for {{RankingPredictor}} and
{{Predictor}} so the pipelining mechanism might not work. A {{Predictor}} can
only be at the and of the pipeline, so this should not really be a problem, but
I do not know for sure. An alternative solution would be to have different
objects {{ALS}} and {{RankingALS}} that give different predictions, but both
extends only a {{Predictor}}. There could be implicit conversions between the
two. I would prefer the current solution if it does not break the pipelining.
[~tvas] What do you think about this?
(This seems to be a problem similar to having a {{predict_proba}} function in
scikit learn classification models, where the same model for the same input
gives two different predictions: a {{predict}} for discrete predictions and
{{predict_proba}} for giving a probability.)
On the other hand, we seem to have solved the scoring issue. The users can
evaluate a recommendation algorithm such as ALS by using a score operating on
rankings (e.g. NDCG), or a score operating on ratings (e.g. RMSE). They only
need to modify the {{Score}} they use in their code, and nothing else.
The main problem was that the {{evaluate}} method and
{{EvaluateDataSetOperation}} were not general enough. They prepare the
evaluation to {{(trueValue, predictedValue)}} pairs (i.e. a
{{DataSet\[(PredictionType, PredictionType)\]}}), while ranking evaluations
needed a more general input with the true ratings
({{DataSet\[(Int,Int,Double)\]}}) and the predicted rankings
({{DataSet\[(Int,Int,Int)\]}}).
Instead of using {{EvaluateDataSetOperation}} we use a more general
{{PrepareOperation}}. We rename the {{Score}} in the original evaluation
framework to {{PairwiseScore}}. {{RankingScore}} and {{PairwiseScore}} has a
common trait {{AbstractScore}}. This way the user can use both a
{{RankingScore}} and a {{PairwiseScore}} for a certain model, and only need to
alter the score used in the code.
In case of pairwise scores (that only need true and predicted value pairs for
evaluation) {{EvaluateDataSetOperation}} is used as a {{PrepareOperation}}. It
prepares the evaluation by creating {{(trueValue, predicitedValue)}} pairs from
the test dataset. Thus, the result of preparing and the input of
{{PairwiseScore}}s will be {{DataSet\[(PredictionType,PredictionType)\]}}. In
case of rankings the {{PrepareOperation}} passes the test dataset and creates
the rankings. The result of preparing and the input of {{RankingScore}}s will
be {{(DataSet\[Int,Int,Double\], DataSet\[Int,Int,Int\])}}. I believe this is a
fairly acceptable solution that avoids breaking the API.
We did not go along with the implementation, documentation, and cleaning up the
code, as we need feedback regarding API decisions. Are we on the right path?
What do you think about our solution? How acceptable is it?
The sketch code can be found on this branch:
[https://github.com/gaborhermann/flink/tree/ranking-rec-eval]
> Implementing ranking evaluation scores for recommender systems
> --------------------------------------------------------------
>
> Key: FLINK-4713
> URL: https://issues.apache.org/jira/browse/FLINK-4713
> Project: Flink
> Issue Type: New Feature
> Components: Machine Learning Library
> Reporter: Domokos Miklós Kelen
> Assignee: Gábor Hermann
>
> Follow up work to [4712|https://issues.apache.org/jira/browse/FLINK-4712]
> includes implementing ranking recommendation evaluation metrics (such as
> precision@k, recall@k, ndcg@k), [similar to Spark's
> implementations|https://spark.apache.org/docs/1.5.0/mllib-evaluation-metrics.html#ranking-systems].
> It would be beneficial if we were able to design the API such that it could
> be included in the proposed evaluation framework (see
> [2157|https://issues.apache.org/jira/browse/FLINK-2157]).
> In it's current form, this would mean generalizing the PredictionType type
> parameter of the Score class to allow for {{Array[Int]}} or {{Array[(Int,
> Double)]}}, and outputting the recommendations in the form {{DataSet[(Int,
> Array[Int])]}} or {{DataSet[(Int, Array[(Int,Double)])]}} meaning (user,
> array of items), possibly including the predicted scores as well.
> However, calculating for example nDCG for a given user u requires us to be
> able to access all of the (u, item, relevance) records in the test dataset,
> which means we would need to put this information in the second element of
> the {{DataSet[(PredictionType, PredictionType)]}} input of the scorer
> function as PredictionType={{Array[(Int, Double)]}}. This is problematic, as
> this Array could be arbitrarily long.
> Another option is to further rework the proposed evaluation framework to
> allow us to implement this properly, with inputs in the form of
> {{recommendations : DataSet[(Int,Int,Int)]}} (user, item, rank) and {{test :
> DataSet[(Int,Int,Double)]}} (user, item relevance). This way, the scores
> could be implemented such that they can be calculated in a distributed way.
> The third option is to implement the scorer functions outside the evaluation
> framework.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)