You are completely right, the simple interface would only be usable for readonly / batch-updatable recommenders. Online recommenders might need something different. I tried to widen the discussion here to discuss all kinds of API changes in the recommenders that would be necessary in the future.
2013/7/17 Peng Cheng <pc...@uowmail.edu.au> > One thing that suddenly comes to my mind is that, for a simple interface > like FactorizablePreferences, maybe sequential READ in real time is > possible, but sequential WRITE in O(1) time is Utopia. Because you need to > flush out old preference with same user and item ID (in worst case it could > be an interpolation search), otherwise you are permitting a user rating an > item twice with different values. Considering how FileDataModel suppose to > work (new files flush old files), maybe using the simple interface has less > advantages than we used to believe. > > > On 13-07-17 04:58 PM, Sebastian Schelter wrote: > >> Hi Peng, >> >> I never wanted to discard the old interface, I just wanted to split it up. >> I want to have a simple interface that only supports sequential access >> (and >> allows for very memory efficient implementions, e.g. by the use of >> primitive arrays). DataModel should *extend* this interface and provide >> sequential and random access (basically what is already does). >> >> Than a recommender such as SGD could state that it only needs sequential >> access to the preferences and you can either feed it a DataModel (so we >> don"t break backwards compatibility) or a memory efficient sequential >> access thingy. >> >> Does that make sense for you? >> >> >> 2013/7/17 Peng Cheng <pc...@uowmail.edu.au> >> >> I see, OK so we shouldn't use the old implementation. But I mean, the old >>> interface doesn't have to be discarded. The discrepancy between your >>> FactorizablePreferences and DataModel is that, your model supports >>> getPreferences(), which returns all preferences as an iterator, and >>> DataModel supports a few old functions that returns preferences for an >>> individual user or item. >>> >>> My point is that, it is not hard for each of them to implement what they >>> lack of: old DataModel can implement getPreferences() just by a a loop in >>> abstract class. Your new FactorizablePreferences can implement those old >>> functions by a binary search that takes O(log n) time, or an >>> interpolation >>> search that takes O(log log n) time in average. So does the online >>> update. >>> It will just be a matter of different speed and space, but not different >>> interface standard, we can use old unit tests, old examples, old >>> everything. And we will be more flexible in writing ensemble recommender. >>> >>> Just a few thoughts, I'll have to validate the idea first before creating >>> a new JIRA ticket. >>> >>> Yours Peng >>> >>> >>> >>> On 13-07-16 02:51 PM, Sebastian Schelter wrote: >>> >>> I completely agree, Netflix is less than one gigabye in a smart >>>> representation, 12x more memory is a nogo. The techniques used in >>>> FactorizablePreferences allow a much more memory efficient >>>> representation, >>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits >>>> into >>>> 3GB with that approach. >>>> >>>> >>>> 2013/7/16 Ted Dunning <ted.dunn...@gmail.com> >>>> >>>> Netflix is a small dataset. 12G for that seems quite excessive. >>>> >>>>> Note also that this is before you have done any work. >>>>> >>>>> Ideally, 100million observations should take << 1GB. >>>>> >>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au> >>>>> wrote: >>>>> >>>>> The second idea is indeed splendid, we should separate >>>>> time-complexity >>>>> >>>>>> first and space-complexity first implementation. What I'm not quite >>>>>> sure, >>>>>> is that if we really need to create two interfaces instead of one. >>>>>> Personally, I think 12G heap space is not that high right? Most new >>>>>> >>>>>> laptop >>>>> >>>>> can already handle that (emphasis on laptop). And if we replace hash >>>>>> map >>>>>> (the culprit of high memory consumption) with list/linkedList, it >>>>>> would >>>>>> simply degrade time complexity for a linear search to O(n), not too >>>>>> bad >>>>>> either. The current DataModel is a result of careful thoughts and has >>>>>> underwent extensive test, it is easier to expand on top of it instead >>>>>> of >>>>>> subverting it. >>>>>> >>>>>> >>> > >