Re: iletken paper

Sean Owen Wed, 10 Feb 2010 14:38:14 -0800

I'm copying below my initial comments on the white paper. It
highlighted several real gaps in the framework, several of which have
been addressed now. Memory usage and performance are broadly 3-4x
better, and there are distributed algorithms available now.



I'm not sure I found the comparison very valid. It's comparing
distributing a general-purpose, non-distributed, real-time recommender
(and some experimental code at that) against a purpose-built,
distributed, offline system. The author didn't seem to attempt to tune
Mahout or inquire about it. Hey, they're selling their solution, who
can blame... of course you pick a use case that's best suited to your
solution.

I'm sure you could turn it around, and benchmark the two in a system
where you need real-time recommendations, for instance, and find this
other system doesn't work at all. But that wouldn't prove much either,
so I wouldn't write a paper on it.


There's simply no right answer for recommendations. It depends on the
data and its scale. An approach that works well in one context might
be worthless in another. You'd have to try solutions on your own data
/ infrastructure to really know what's best.



---------------

Deniz and already exchanged a few messages. If I might paraphrase the result --

- Some of this I completely agree with. The memory usage is too high.
This is what the big MAHOUT-151 and MAHOUT-154 patches are about.

- The big take-away from this is the collaborative filtering code is
not really delivering on what Mahout advertises: Mahout says it is
very distributed-processing-centric, and the collaborative filtering
code is not, at heart, by design. So when you ask whether a
distributed system scales better than a non-distributed system at a
scale where one must distribute, the result is not surprising. The
test did attempt to use the one algorithm that is half-distributed.

- The CF code is missing a truly distributable recommendation
algorithm, and that is a gap, though it has a partially distributed
algo and a sort of pseudo-distributed mode for all algorithms

- I don't believe a distributed computation is appropriate in all or
even most CF settings. It is too much overhead for a small
organization, or a small problem, and does not suit contexts where
real-time recommendations are required. But, of course, there are some
situations where a big distributed computation is the only option.

- The CF code is just that -- for general collaborative filtering, and
not other things. It does not advertise itself as specialized for a
domain or as a tool for related tasks. I do not think this is
therefore a failing of the library.

- This was a test of one algorithm (by necessity, see above) --
slope-one, and based on some junky example code I wrote a long time
ago for Netflix (again, by necessity). In slope-one's defense, it is
not clear that it is the most appropriate algo for the data set tested
(Netflix) and my implementation does not include Daniel Lemire's (its
'creator') preferred variant, called bi-polar slope-one. But in any
event this result amounts to an evaluation of one algorithm only, and
not really a statement about the framework. Which I don't think it was
meant to be. Because the point of the framework is that it provides
several algorithms and components for making more.

- There is a tradeoff between designing for generality, and designing
for performance or a particular domain. If it made more assumptions,
the framework could be faster or more accurate for a particular
domain. I take this as some indication the framework remains too much
on the side of generality, and could stand to make stronger
assumptions and restrictions on its input and problem domain. (See my
note on switching to longs only for IDs.)



On Wed, Feb 10, 2010 at 9:59 PM, Claudio Martella
<[email protected]> wrote:
> Hi list,
>
>
> I'm quite new to mahout so i made a brief search on the net looking for
> some benchmarking of the library. I ran into this paper:
> http://iletken-project.com/documents/mahout_review_by_iletken.pdf that
> i'm sure you know about. Could you comment the issues this paper is
> talking about? Could you give me some pointer to some answer?
>
>
> thanks
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> [email protected] http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13 of 
> Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
> process your personal data in order to fulfil contractual and fiscal 
> obligations and also to send you information regarding our services and 
> events. Your personal data are processed with and without electronic means 
> and by respecting data subjects' rights, fundamental freedoms and dignity, 
> particularly with regard to confidentiality, personal identity and the right 
> to personal data protection. At any time and without formalities you can 
> write an e-mail to [email protected] in order to object the processing of 
> your personal data for the purpose of sending advertising materials and also 
> to exercise the right to access personal data and other rights referred to in 
> Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation 
> Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete 
> information on the web site www.tis.bz.it.
>
>
>

Re: iletken paper

Reply via email to