On Mar 25, 2013, at 4:10 AM, Sebastian Schelter wrote: > Hi, > > throwing in my 2 cents here: > > I think that you mentioned a very good point with stating that it is not > clear whether Mahout is a library, a standalone program to interact with > via the command line. IMO, its first and foremost a library (similar to > Lucene), and this should also be reflected in the codebase.
That is my view as well and I think we have been moderately successful at it. > > I don't agree that we simply lack manpower but have a clear vision. I > actually think its the other way round. I think Mahout is kind of stuck, > because it does not have a clear vision. I think we faced and still face > very hard challenges, as we have to provide answers for the following > questions: > > * for which problems and algorithms does it really make sense to use > MapReduce? My test is simply whether someone has implemented it or not. I don't think we have to have a line in the sand. A working, tested, demonstrable implementation beats the one that isn't, regardless of which approach it uses, so I don't think we have to decide up front but instead look at it on a case by case basis. At the end of the day, those who do the work get to decide. > > * how broad can the spectrum of things that we offer be without a > decline in quality? > > * how do we deal with the fact that our codebase is split up into a > collection of algorithms with very few people being able to work on all > of them, due to the required theoretical background and the complexity > of efficient code > > * how do we provide solutions that allow users to scale very fine > grained, e.g. from online to precomputed on a single machine to > precomputed via Hadoop in the recommender stuff. I don't see these as vision issues, I see them as implementation issues. Regardless, it doesn't matter which category they fall under, as they are the important issues we face. As for the complexity issue, I don't know that we ever solve it, we just need to identify contributors in those areas quickly, mentor them, and make them committers as soon as they are ready. > > I think that Mahout is and should always be more than recommenders, but > that we should be more courageous in throwing out things that are not > used very much or not maintained very much or don't meet the quality > standards which we would like to see. +1. I think we have gotten a lot better at this, thanks to Sean, you and others. > > It is also my personal experience (= I heard it over and over again from > our users) that it is extremely hard to get started with Mahout using > the available documentation. MiA is the exception to this, but people > have to buy it first and it lacks a lot of the latest developments. It > would be awesome to have a reworked wiki that is qualitatively > comparable to MiA. > Good docs are always hard. Whatever reduces barriers, the better. Going w/ the Github model, there's a lot to be said for Javadocs and/or Markdown right in the code base, but neither solves the developer inertia of actually writing them. > Best, > Sebastian > > On 25.03.2013 07:29, Isabel Drost-Fromm wrote: >> >> >> On Monday, March 25, 2013 07:22:46 AM Isabel Drost-Fromm wrote: >>> On Sunday, March 24, 2013 05:38:00 PM Grant Ingersoll wrote: >>>> On Mar 24, 2013, at 5:03 PM, Isabel Drost-Fromm wrote: >>>>> What about an experiment: If you (reading this mail) were to write a two >>>>> sentence vision statement for Mahout as you see it - what would that be? >>>> >>>> Produce open source, scalable machine learning code using a community >>>> development model. >>> >>> So taking that apart: >>> >>> - Hadoop is not necessarily part of the equation. All that we promise are >>> implemenations that are reasonably scalable. >> >> - We play well with small-ish (fits in memory) and large (fits only in >> memory of >> many machines) or huge (fits only on disk) datasets. >> >>> - There is no restriction in there wrt. supporting only specific use cases - >>> in particular no restriction to be recommendations only. >>> >>> - There is no restriction to "only batch" or "only online" learning. >>> >>> If we want to be that broad we definitely lack lots of people, I think. >>> >>> The other question that I cannot answer today: Do we want to be a Java >>> Library that people link with their project, a standalone program that >>> people interact with via the command line, a basis that people can easily >>> integrate into their Pig/Hive/Cascalog/Scalding/Cascading/what-ever-else >>> workflows or all of these? >> >> > -------------------------------------------- Grant Ingersoll | @gsingers http://www.lucidworks.com