Flink ML recommender system API

Gábor Hermann Tue, 04 Oct 2016 03:48:15 -0700

Hey all,

We've been working on improvements for the recommendation in Flink ML,and some API design questions have come up. Our plans in short:


- Extend ALS to work on implicit feedback datasets [1]
- DSGD implementation for matrix factorization [2]
- Ranking prediction based on a matrix factorization model [3]
- Evaluations for recommenders (precision, recall, nDCG) [4]

First, we've seen that an evaluation framework has been implemented (ina not yet merged PR [5]), but evalations of recommenders would not fitinto this framework. This is basically because recommender evaluations,instead of comparing real numbers or fixed size vectors, compare toplists of possible different, arbitrary large sizes. The details aredescirbed in FLINK-4713 [4]. I see three possible solutions for this:

- we either rework the evaluation framework proposed in [5] to allowinputs suitable for recommender evaluations- or fit the recommender evaluations in the framework in a kind ofunnatural form with possible bad performance implications

- or do not fit recommender evaluations in the framework at all

I would prefer reworking the evaluation framework, but it's up todiscussion. It also depends on whether the PR will be merged soon ornot. Theodore, what are your thoughts on this as the author of the evalframework?

Second, picking the form of evaluation also affects how we should givethe ranking prediction. We could choose a flat form (i.e.DataSet[(Int,Int,Int)]) or represent the rankings in an array (i.e.DataSet[(Int,Array[Int])]). See details in [4]. The flat form wouldallow the system to work distributedly, so I'd go with thatrepresentation, but it's also up to discussion.

Last, ALS and DSGD are two different algorithms for training the samematrix factorization model, but in the current API could not be reallyvisible to the user. Training an ALS model modifies the ALS object andputs a matrix factorization model in it. We could do the same with DSGDand have a common abstraction (say a superclass MatrixFactorization).However, in my opinion, it might be more straightforward if ALS.fitwould return a different object (say MatrixFactorizationModel akin toSpark [6]) containing the DataSets representing the factors. By usingthis approach, we could avoid checking at runtime whether a model hasbeen trained or not, and force the user at compile time to only callpredict on models that have already been trained.

Of course, this could also be applied to other models in Flink ML, andwould be an API breaking change. Were there any reason to pick thecurrent training API design instead of the more "typesafe" one? I amcertain, that we should keep the ML API consistent, so we should eitherchange the training API of all models, or leave them as they ar.Although, I don't think it would take much effort to modify the API. Wecould also keep and depricate the current fit method to avoid breakingthe API. What do you think about this? If there are no objections, I'mhappy to open a JIRA and start working on it.



[1] https://github.com/apache/flink/pull/2542
[2] http://dx.doi.org/10.1145/2020408.2020426
[3] https://issues.apache.org/jira/browse/FLINK-4712
[4] https://issues.apache.org/jira/browse/FLINK-4713
[5] https://github.com/apache/flink/pull/1849

[6]https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L315


Cheers,
Gabor

Flink ML recommender system API

Reply via email to