Re: Call to action – Mahout needs your help

Grant Ingersoll Mon, 25 Mar 2013 08:36:30 -0700

On Mar 25, 2013, at 4:10 AM, Sebastian Schelter wrote:

> Hi,
> 
> throwing in my 2 cents here:
> 
> I think that you mentioned a very good point with stating that it is not
> clear whether Mahout is a library, a standalone program to interact with
> via the command line. IMO, its first and foremost a library (similar to
> Lucene), and this should also be reflected in the codebase.


That is my view as well and I think we have been moderately successful at it.

> 
> I don't agree that we simply lack manpower but have a clear vision. I
> actually think its the other way round. I think Mahout is kind of stuck,
> because it does not have a clear vision. I think we faced and still face
> very hard challenges, as we have to provide answers for the following
> questions:
> 
> * for which problems and algorithms does it really make sense to use
> MapReduce?

My test is simply whether someone has implemented it or not.  I don't think we 
have to have a line in the sand.  A working, tested, demonstrable 
implementation beats the one that isn't, regardless of which approach it uses, 
so I don't think we have to decide up front but instead look at it on a case by 
case basis.  At the end of the day, those who do the work get to decide.

> 
> * how broad can the spectrum of things that we offer be without a
> decline in quality?
> 
> * how do we deal with the fact that our codebase is split up into a
> collection of algorithms with very few people being able to work on all
> of them, due to the required theoretical background and the complexity
> of efficient code
> 
> * how do we provide solutions that allow users to scale very fine
> grained, e.g. from online to precomputed on a single machine to
> precomputed via Hadoop in the recommender stuff.

I don't see these as vision issues, I see them as implementation issues.  
Regardless, it doesn't matter which category they fall under, as they are the 
important issues we face.

As for the complexity issue, I don't know that we ever solve it, we just need 
to identify contributors in those areas quickly, mentor them, and make them 
committers as soon as they are ready.



> 
> I think that Mahout is and should always be more than recommenders, but
> that we should be more courageous in throwing out things that are not
> used very much or not maintained very much or don't meet the quality
> standards which we would like to see.

+1.  I think we have gotten a lot better at this, thanks to Sean, you and 
others.

> 
> It is also my personal experience (= I heard it over and over again from
> our users) that it is extremely hard to get started with Mahout using
> the available documentation. MiA is the exception to this, but people
> have to buy it first and it lacks a lot of the latest developments. It
> would be awesome to have a reworked wiki that is qualitatively
> comparable to MiA.
> 

Good docs are always hard.  Whatever reduces barriers, the better.  Going w/ 
the Github model, there's a lot to be said for Javadocs and/or Markdown right 
in the code base, but neither solves the developer inertia of actually writing 
them.


> Best,
> Sebastian
> 
> On 25.03.2013 07:29, Isabel Drost-Fromm wrote:
>> 
>> 
>> On Monday, March 25, 2013 07:22:46 AM Isabel Drost-Fromm wrote:
>>> On Sunday, March 24, 2013 05:38:00 PM Grant Ingersoll wrote:
>>>> On Mar 24, 2013, at 5:03 PM, Isabel Drost-Fromm wrote:
>>>>> What about an experiment: If you (reading this mail) were to write a two
>>>>> sentence vision statement for Mahout as you see it - what would that be?
>>>> 
>>>> Produce open source, scalable machine learning code using a community
>>>> development model.
>>> 
>>> So taking that apart:
>>> 
>>> - Hadoop is not necessarily part of the equation. All that we promise are
>>> implemenations that are reasonably scalable.
>> 
>> - We play well with small-ish (fits in memory) and large (fits only in 
>> memory of 
>> many machines) or huge (fits only on disk) datasets.
>> 
>>> - There is no restriction in there wrt. supporting only specific use cases -
>>> in particular no restriction to be recommendations only.
>>> 
>>> - There is no restriction to "only batch" or "only online" learning.
>>> 
>>> If we want to be that broad we definitely lack lots of people, I think.
>>> 
>>> The other question that I cannot answer today: Do we want to be a Java
>>> Library that people link with their project, a standalone program that
>>> people interact with via the command line, a basis that people can easily
>>> integrate into their Pig/Hive/Cascalog/Scalding/Cascading/what-ever-else
>>> workflows or all of these?
>> 
>> 
> 

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com

Re: Call to action – Mahout needs your help

Reply via email to