I'm a big proponent of Mahout for putting ML into production so please take 
these comments in context. I am hugely grateful that Mahout exists and that you 
guys keep it on the leading edge.

Recently I had the job of trying to explain how to use Mahout to someone taking 
over my code. The questions he asked brought up a lot of issues I'd already 
absorbed and learned to deal with. Here are some of the rough edges that came 
up:

It would be nice if *all* vectors supported attachment of properties. Something 
like the named vector but allowing arbitrary properties (there used to be a 
"PropertyVector" I think but it was removed from some code). This is huge 
actually any time you have a multi-job pipeline you have to introduce indexes 
and reverse indexes that can become a scaling problem. They work ok (though 
they are a pain to code up all the time) if they fit into memory, if not the 
pain becomes major. Many of these would disappear if arbitrary keys could be 
attached to any and all vectors. I understand that this doesn't make sense if a 
vector is transposed or elsewhere but there are many cases where it does. 
Scaling some jobs requires splitting files to get multiple mappers and in other 
cases files must be combined into one to even run the job. Scaling is one huge 
reason to use Mahout, so it seems like it should be easier and input should 
universally be a dir of parts *or* single file. It would be nice to have a 
consistent way to auto scale and consistent input and output expectations or 
options (maybe Pig is a good model?).

Maybe it's good enough to keep these things in mind when new code is written. I 
understand that you guys work on Mahout gratis and it's always more fun to add 
new things than clean up. I've also seen committers who responded very quickly 
to requests. It seem a bit petty to mention these things but they would go a 
fair way towards polishing Mahout. 

Yes, I know I'm welcome to submit patches and will try to do so.

Reply via email to