I'm copying below my initial comments on the white paper. It highlighted several real gaps in the framework, several of which have been addressed now. Memory usage and performance are broadly 3-4x better, and there are distributed algorithms available now.
I'm not sure I found the comparison very valid. It's comparing distributing a general-purpose, non-distributed, real-time recommender (and some experimental code at that) against a purpose-built, distributed, offline system. The author didn't seem to attempt to tune Mahout or inquire about it. Hey, they're selling their solution, who can blame... of course you pick a use case that's best suited to your solution. I'm sure you could turn it around, and benchmark the two in a system where you need real-time recommendations, for instance, and find this other system doesn't work at all. But that wouldn't prove much either, so I wouldn't write a paper on it. There's simply no right answer for recommendations. It depends on the data and its scale. An approach that works well in one context might be worthless in another. You'd have to try solutions on your own data / infrastructure to really know what's best. --------------- Deniz and already exchanged a few messages. If I might paraphrase the result -- - Some of this I completely agree with. The memory usage is too high. This is what the big MAHOUT-151 and MAHOUT-154 patches are about. - The big take-away from this is the collaborative filtering code is not really delivering on what Mahout advertises: Mahout says it is very distributed-processing-centric, and the collaborative filtering code is not, at heart, by design. So when you ask whether a distributed system scales better than a non-distributed system at a scale where one must distribute, the result is not surprising. The test did attempt to use the one algorithm that is half-distributed. - The CF code is missing a truly distributable recommendation algorithm, and that is a gap, though it has a partially distributed algo and a sort of pseudo-distributed mode for all algorithms - I don't believe a distributed computation is appropriate in all or even most CF settings. It is too much overhead for a small organization, or a small problem, and does not suit contexts where real-time recommendations are required. But, of course, there are some situations where a big distributed computation is the only option. - The CF code is just that -- for general collaborative filtering, and not other things. It does not advertise itself as specialized for a domain or as a tool for related tasks. I do not think this is therefore a failing of the library. - This was a test of one algorithm (by necessity, see above) -- slope-one, and based on some junky example code I wrote a long time ago for Netflix (again, by necessity). In slope-one's defense, it is not clear that it is the most appropriate algo for the data set tested (Netflix) and my implementation does not include Daniel Lemire's (its 'creator') preferred variant, called bi-polar slope-one. But in any event this result amounts to an evaluation of one algorithm only, and not really a statement about the framework. Which I don't think it was meant to be. Because the point of the framework is that it provides several algorithms and components for making more. - There is a tradeoff between designing for generality, and designing for performance or a particular domain. If it made more assumptions, the framework could be faster or more accurate for a particular domain. I take this as some indication the framework remains too much on the side of generality, and could stand to make stronger assumptions and restrictions on its input and problem domain. (See my note on switching to longs only for IDs.) On Wed, Feb 10, 2010 at 9:59 PM, Claudio Martella <[email protected]> wrote: > Hi list, > > > I'm quite new to mahout so i made a brief search on the net looking for > some benchmarking of the library. I ran into this paper: > http://iletken-project.com/documents/mahout_review_by_iletken.pdf that > i'm sure you know about. Could you comment the issues this paper is > talking about? Could you give me some pointer to some answer? > > > thanks > > -- > Claudio Martella > Digital Technologies > Unit Research & Development - Analyst > > TIS innovation park > Via Siemens 19 | Siemensstr. 19 > 39100 Bolzano | 39100 Bozen > Tel. +39 0471 068 123 > Fax +39 0471 068 129 > [email protected] http://www.tis.bz.it > > Short information regarding use of personal data. According to Section 13 of > Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we > process your personal data in order to fulfil contractual and fiscal > obligations and also to send you information regarding our services and > events. Your personal data are processed with and without electronic means > and by respecting data subjects' rights, fundamental freedoms and dignity, > particularly with regard to confidentiality, personal identity and the right > to personal data protection. At any time and without formalities you can > write an e-mail to [email protected] in order to object the processing of > your personal data for the purpose of sending advertising materials and also > to exercise the right to access personal data and other rights referred to in > Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation > Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete > information on the web site www.tis.bz.it. > > >
