This is the most extreme case. An large auto-parts store targeting mainly auto mechanics will have data showing different distribution patterns over years than lets say a t-shirt store. At this point I'm estimated/speculating based on a limited dataset set I've acquired so far. If you or anyone else has or knows of better statistics(variation, extremes) that would be extremely helpful. With regards to the use of the in-memory algorithms - I was under the impression that those would not work on this model. Is there a rule of thumb that connects the model characteristics to the resources needed to run an in-memory algorithm? In this case I assume that 10 million significant occurrences come from a much larger set of item-to-item matrix after applying a min_support threshold or similar. Is this size of the item-to-item determining the memory requirements for the algorithm? Also is memory needed to process the full item-to-item matrix or only the final one with the threshold applied?If I would have 1 bln items in the matrix what would the algorithm's memory footprint be? 20Gb? Again, if there's a best practices available to link the characteristics of a model with the algorithms viability - that would be extremely useful. Currently I'm storing the full item-to-item matrix to support future incremental update of the model. Could this somehow be done in Mahout or is a full run required every time? Thanks for your time.-qf --- On Tue, 5/4/10, Ted Dunning <ted.dunn...@gmail.com> wrote:
From: Ted Dunning <ted.dunn...@gmail.com> Subject: Re: Algorithm scalability To: mahout-user@lucene.apache.org Received: Tuesday, May 4, 2010, 5:27 PM This is much denser than I would expect. You are saying that you would have an average of 1000 transactions per user. It is more normal to have 100 or less. If you have these smaller sizes, then in-memory algorithms on a single (large) machine begin to be practical. On Tue, May 4, 2010 at 2:01 PM, Sean Owen <sro...@gmail.com> wrote: > On Tue, May 4, 2010 at 9:53 PM, First Qaxy <qa...@yahoo.ca> wrote: > > Purely based on estimates, assuming 5 billion transactions, 5 million > users, 100K products normally distributed are expected to create a sparse > item to item matrix of up to 10 Million significant co-occurrences > (significance is not globally defined but in the context of the active item > to recommend from; in other words support can be really tiny, confidence > less so). >