Hi Gokhan, I like your proposals and I think this is an important discussion. Peng is also interested in working on online recommenders, so we should try to team up our efforts. I'd like to extend the discussion a little to related API changes, that I think are necessary.
What do you think about completely removing the setPreference() and removePreference() methods from Recommender? I think they don't belong there for two reasons: First, they duplicate functionality from DataModel and second, a lot of recommenders are read-only/train-once and cannot handle single preference updates anyway. I think we should have a DataModel implementation that can be updated and an online learning recommender should be able to register to be notified with updates. We should further more split up the DataModel interface into a hierarchy of three parts: First, a simple readonly interface that allows sequential access to the data (similar to FactorizablePreferences). This allows us to create memory efficient implementations. E.g. Cheng reported in MAHOUT-1272 that the current DataModel needs 12GB heap for the Netflix dataset (100M ratings) which is unacceptable. I was able to fit the KDD Music dataset (250M ratings) into 3GB with FactorizablePreferences. The second interface would extend the readonly interface and should resemble what DataModel is today: An easy-to-use in-memory implementation that trades high memory consumption for convenient random access. And finally the third interface would extend the second and provide tooling for online updates of the data. What do you think of that? Does it sound reasonable? --sebastian > The DataModel I imagine would follow the current API, where underlying > preference storage is replaced with a matrix. > > A Recommender would then use the DataModel and the OnlineLearner, where > Recommender#setPreference is delegated to DataModel#setPreference (like it > does now), and DataModel#setPreference triggers OnlineLearner#train.
