Deniz and already exchanged a few messages. If I might paraphrase the result --
- Some of this I completely agree with. The memory usage is too high. This is what the big MAHOUT-151 and MAHOUT-154 patches are about. - The big take-away from this is the collaborative filtering code is not really delivering on what Mahout advertises: Mahout says it is very distributed-processing-centric, and the collaborative filtering code is not, at heart, by design. So when you ask whether a distributed system scales better than a non-distributed system at a scale where one must distribute, the result is not surprising. The test did attempt to use the one algorithm that is half-distributed. - The CF code is missing a truly distributable recommendation algorithm, and that is a gap, though it has a partially distributed algo and a sort of pseudo-distributed mode for all algorithms - I don't believe a distributed computation is appropriate in all or even most CF settings. It is too much overhead for a small organization, or a small problem, and does not suit contexts where real-time recommendations are required. But, of course, there are some situations where a big distributed computation is the only option. - The CF code is just that -- for general collaborative filtering, and not other things. It does not advertise itself as specialized for a domain or as a tool for related tasks. I do not think this is therefore a failing of the library. - This was a test of one algorithm (by necessity, see above) -- slope-one, and based on some junky example code I wrote a long time ago for Netflix (again, by necessity). In slope-one's defense, it is not clear that it is the most appropriate algo for the data set tested (Netflix) and my implementation does not include Daniel Lemire's (its 'creator') preferred variant, called bi-polar slope-one. But in any event this result amounts to an evaluation of one algorithm only, and not really a statement about the framework. Which I don't think it was meant to be. Because the point of the framework is that it provides several algorithms and components for making more. - There is a tradeoff between designing for generality, and designing for performance or a particular domain. If it made more assumptions, the framework could be faster or more accurate for a particular domain. I take this as some indication the framework remains too much on the side of generality, and could stand to make stronger assumptions and restrictions on its input and problem domain. (See my note on switching to longs only for IDs.) On Mon, Aug 3, 2009 at 6:11 PM, Otis Gospodnetic<[email protected]> wrote: > Hi Deniz, > > Maybe you could get help with optimal Taste configuration and usage from this > list and then redo the review/eval? > > It's the same logic I described here: > http://www.jroller.com/otis/entry/open_source_search_engine_benchmark > > (note the bolded part) > > Otis > -- > Sematext is hiring -- http://sematext.com/about/jobs.html?mls > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR > >
