Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only 
factorized. Ted talks about using Solr for this, which we're experimenting with 
alongside Myrrix. I suspect Solr works but it does require a bit of tinkering 
and doesn't have quite the same set of options--no llr similarity for instance. 

On the same subject I recently attended a workshop in Seattle for UAI2013 where 
Walmart reported similar results using a factorized recommender. They had to 
increase the factor number past where it would perform well. Along the way they 
saw increasing performance measuring precision offline. They eventually gave up 
on a factorized solution. This decision seems odd but anyway… In the case of 
Walmart and our data set they are quite diverse. The best idea is probably to 
create different recommenders for separate parts of the catalog but if you 
create one model on all items our intuition is that item-based works better 
than factorized. Again caveat--no A/B tests to support this yet.

Doing an online item-based recommender would quickly run into scaling problems, 
no? We put together the simple Mahout in-memory version and it could not really 
handle more than a down-sampled few months of our data. Down-sampling lost us 
20% of our precision scores so we moved to the hadoop version. Now we have 
use-cases for an online recommender that handles anonymous new users and that 
takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <s...@apache.org> wrote:

Hi Pat

I think we should provide a simple support for recommending to anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel <pat.fer...@gmail.com>

> May I ask how you plan to support model updates and 'anonymous' users?
> 
> I assume the latent factors model is calculated offline still in batch
> mode, then there are periodic updates? How are the updates handled? Do you
> plan to require batch model refactorization for any update? Or perform some
> partial update by maybe just transforming new data into the LF space
> already in place then doing full refactorization every so often in batch
> mode?
> 
> By 'anonymous users' I mean users with some history that is not yet
> incorporated in the LF model. This could be history from a new user asked
> to pick a few items to start the rec process, or an old user with some new
> action history not yet in the model. Are you going to allow for passing the
> entire history vector or userID+incremental new history to the recommender?
> I hope so.
> 
> For what it's worth we did a comparison of Mahout Item based CF to Mahout
> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of
> data. The data was purchase data from a diverse ecom source with a large
> variety of products from electronics to clothes. We found Item based CF did
> far better than ALS. As we increased the number of latent factors the
> results got better but were never within 10% of item based (we used MAP as
> the offline metric). Not sure why but maybe it has to do with the diversity
> of the item types.
> 
> I understand that a full item based online recommender has very different
> tradeoffs and anyway others may not have seen this disparity of results.
> Furthermore we don't have A/B test results yet to validate the offline
> metric.
> 
> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gkhn...@gmail.com> wrote:
> 
> Peng,
> 
> This is the reason I separated out the DataModel, and only put the learner
> stuff there. The learner I mentioned yesterday just stores the
> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
> where preferences are stored.
> 
> I, kind of, agree with the multi-level DataModel approach:
> One for iterating over "all" preferences, one for if one wants to deploy a
> recommender and perform a lot of top-N recommendation tasks.
> 
> (Or one DataModel with a strategy that might reduce existing memory
> consumption, while still providing fast access, I am not sure. Let me try a
> matrix-backed DataModel approach)
> 
> Gokhan
> 
> 
> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <s...@apache.org>
> wrote:
> 
>> I completely agree, Netflix is less than one gigabye in a smart
>> representation, 12x more memory is a nogo. The techniques used in
>> FactorizablePreferences allow a much more memory efficient
> representation,
>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
> into
>> 3GB with that approach.
>> 
>> 
>> 2013/7/16 Ted Dunning <ted.dunn...@gmail.com>
>> 
>>> Netflix is a small dataset.  12G for that seems quite excessive.
>>> 
>>> Note also that this is before you have done any work.
>>> 
>>> Ideally, 100million observations should take << 1GB.
>>> 
>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>> wrote:
>>> 
>>>> The second idea is indeed splendid, we should separate time-complexity
>>>> first and space-complexity first implementation. What I'm not quite
>> sure,
>>>> is that if we really need to create two interfaces instead of one.
>>>> Personally, I think 12G heap space is not that high right? Most new
>>> laptop
>>>> can already handle that (emphasis on laptop). And if we replace hash
>> map
>>>> (the culprit of high memory consumption) with list/linkedList, it would
>>>> simply degrade time complexity for a linear search to O(n), not too bad
>>>> either. The current DataModel is a result of careful thoughts and has
>>>> underwent extensive test, it is easier to expand on top of it instead
>> of
>>>> subverting it.
>>> 
>> 
> 
> 

Reply via email to