This is the most extreme case. An large auto-parts store targeting mainly auto 
mechanics will have data showing different distribution patterns over years 
than lets say a t-shirt store. At this point I'm estimated/speculating  based 
on a limited dataset set I've acquired so far. If you or anyone else has or 
knows of better statistics(variation, extremes) that would be extremely helpful.
With regards to the use of the in-memory algorithms - I was under the 
impression that those would not work on this model. Is there a rule of thumb 
that connects the model characteristics to the resources needed to run an 
in-memory algorithm? In this case I assume that 10 million significant 
occurrences come from a much larger set of item-to-item matrix after applying a 
min_support threshold or similar. Is this
 size of the item-to-item determining the memory requirements for the 
algorithm? Also is memory needed to process the full item-to-item matrix or 
only the final one with the threshold applied?If I would have 1 bln items in 
the matrix what would the algorithm's memory footprint be? 20Gb? Again, if 
there's a best practices available to link the characteristics of a model with 
the algorithms viability - that would be extremely useful.
Currently I'm storing the full item-to-item matrix to support future 
incremental update of the model. Could this somehow be done in Mahout or is a 
full run required every time? 
Thanks for your time.-qf
--- On Tue, 5/4/10, Ted Dunning <ted.dunn...@gmail.com> wrote:

From: Ted Dunning <ted.dunn...@gmail.com>
Subject: Re: Algorithm scalability
To: mahout-user@lucene.apache.org
Received: Tuesday, May 4, 2010, 5:27 PM

This is much denser than I would expect.  You are saying that you would have
an average of 1000 transactions per user.  It is more normal to have 100
 or
less.  If you have these smaller sizes, then in-memory algorithms on a
single (large) machine begin to be practical.

On Tue, May 4, 2010 at 2:01 PM, Sean Owen <sro...@gmail.com> wrote:

> On Tue, May 4, 2010 at 9:53 PM, First Qaxy <qa...@yahoo.ca> wrote:
> > Purely based on estimates, assuming 5 billion transactions, 5 million
> users, 100K products normally distributed are expected to create a sparse
> item to item matrix of up to 10 Million significant co-occurrences
> (significance is not globally defined but in the context of the active item
> to recommend from; in other words support can be really tiny, confidence
> less so).
>


Reply via email to