I had a chance to get feedback last night from a few Old Street startups using Mahout. The overall comments were of course positive -- it provides a solution that's at least 80% ready-to-go and saves a great deal of trial and error in getting towards something working.
The problems I heard were similar to last time. The jobs are uneven and not standard, so each has its own peculiar learning curve. There are evidently still a number of invisible assumptions baked into the code about the file structure and environment too -- I heard again that repeated use of "new Configuration()" around the code breaks things. The experience of Mahout seemed to be one of weeks of trial and error, some of which has to do with understanding the machinery of Hadoop of course. Finally there was a group using the LDA implementation but had abandoned it over scalability concerns -- didn't get more detail on that. I do reiterate that there is, at heart, a significant and eager developer audience who is finding all this really useful, that are burning up a lot of energy just getting started. That's just the nature of this beast at version 0.x, but, I think it just once again underscores that the need is not for new algorithms, but cleaning up, fixing, documenting, streamlining what's already there.
