I'm a big proponent of Mahout for putting ML into production so please take these comments in context. I am hugely grateful that Mahout exists and that you guys keep it on the leading edge.
Recently I had the job of trying to explain how to use Mahout to someone taking over my code. The questions he asked brought up a lot of issues I'd already absorbed and learned to deal with. Here are some of the rough edges that came up: It would be nice if *all* vectors supported attachment of properties. Something like the named vector but allowing arbitrary properties (there used to be a "PropertyVector" I think but it was removed from some code). This is huge actually any time you have a multi-job pipeline you have to introduce indexes and reverse indexes that can become a scaling problem. They work ok (though they are a pain to code up all the time) if they fit into memory, if not the pain becomes major. Many of these would disappear if arbitrary keys could be attached to any and all vectors. I understand that this doesn't make sense if a vector is transposed or elsewhere but there are many cases where it does. Scaling some jobs requires splitting files to get multiple mappers and in other cases files must be combined into one to even run the job. Scaling is one huge reason to use Mahout, so it seems like it should be easier and input should universally be a dir of parts *or* single file. It would be nice to have a consistent way to auto scale and consistent input and output expectations or options (maybe Pig is a good model?). Maybe it's good enough to keep these things in mind when new code is written. I understand that you guys work on Mahout gratis and it's always more fun to add new things than clean up. I've also seen committers who responded very quickly to requests. It seem a bit petty to mention these things but they would go a fair way towards polishing Mahout. Yes, I know I'm welcome to submit patches and will try to do so.
