(cc'ing dev list also)
I think a more general version of ranking metrics that allows arbitrary
relevance scores could be useful. Ranking metrics are applicable to other
settings like search or other learning-to-rank use cases, so it should be a
little more generic than pure recommender settings.
The one issue with the proposed implementation is that it is not compatible
with the existing cross-validators within a pipeline.
As I've mentioned on the linked JIRAs & PRs, one option is to create a
special set of cross-validators for recommenders, that address the issues
of (a) dataset splitting specific to recommender settings (user-based
stratified sampling, time-based etc) and (b) ranking-based evaluation.
The other option is to have the ALSModel itself capable of generating the
"ground-truth" set within the same dataframe output from "transform" (ie
predict top k) that can be fed into the cross-validator (with
RankingEvaluator) directly. That's the approach I took so far in
Both options are valid and have their positives & negatives - open to
comments / suggestions.
On Tue, 20 Sep 2016 at 06:08 Jong Wook Kim <jongw...@nyu.edu> wrote:
> Thanks for the clarification and the relevant links. I overlooked the
> comments explicitly saying that the relevance is binary.
> I understand that the label is not a relevance, but I have been, and I
> think many people are using the label as relevance in the implicit feedback
> context where the user-provided exact label is not defined anyway. I think
> that's why RiVal <https://github.com/recommenders/rival>'s using the term
> "preference" for both the label for MAE and the relevance for NDCG.
> At the same time, I see why Spark decided to assume the relevance is
> binary, in part to conform to the class RankingMetrics's constructor. I
> think it would be nice if the upcoming DataFrame-based RankingEvaluator can
> be optionally set a "relevance column" that has non-binary relevance
> values, otherwise defaulting to either 1.0 or the label column.
> My extended version of RankingMetrics is here:
> https://github.com/jongwook/spark-ranking-metrics . It has a test case
> checking that the numbers are same as RiVal's.
> Jong Wook
> On 19 September 2016 at 03:13, Sean Owen <so...@cloudera.com> wrote:
>> Yes, relevance is always 1. The label is not a relevance score so
>> don't think it's valid to use it as such.
>> On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim <jongw...@nyu.edu> wrote:
>> > Hi,
>> > I'm trying to evaluate a recommendation model, and found that Spark and
>> > Rival give different results, and it seems that Rival's one is what
>> > defines:
>> > Am I using RankingMetrics in a wrong way, or is Spark's implementation
>> > incorrect?
>> > To my knowledge, NDCG should be dependent on the relevance (or
>> > values, but Spark's implementation seems not; it uses 1.0 where it
>> should be
>> > 2^(relevance) - 1, probably assuming that relevance is all 1.0? I also
>> > tweaking, but its method to obtain the ideal DCG also seems wrong.
>> > Any feedback from MLlib developers would be appreciated. I made a
>> > modified/extended version of RankingMetrics that produces the identical
>> > numbers to Kaggle and Rival's results, and I'm wondering if it is
>> > appropriate to be added back to MLlib.
>> > Jong Wook