RE: Board Report

Martin, Nick Mon, 07 Apr 2014 13:34:03 -0700

I'll join Chandler w/ the downstream user input fwiw...

Pat's earlier email described our shop perfectly re: recommendations. We're a 
large organization using Mahout's recommendation capability in several projects 
but shy away from the other components. We have a half dozen business units and 
several of them have had fits-and-starts with Mahout for 
clustering/classification/some fpm but collectively we've started to share the 
recommendation capability because it's approachable, has efficient data 
requirements for input and is fairly well documented for our use cases. I think 
the documentation ills have been captured extensively (esp. recently) on @user 
and even here on @dev around some of the other components and I can vouch that 
folks in our organization cite that as a reason they abandoned Mahout.

I share Chandler's desire (and others that have offered thoughts in this 
direction in the past week or so on @dev) that whatever the roadmap is that 
it's clear and I can plan around it for the next 24-36 months. We have h20 up 
and I confess the potential of migrating our 'data science' activities to a 
singular execution framework/interface/ride-along atop existing Hadoop clusters 
is alluring. We have expansive sprawl wrt stats packages and some diversity in 
ML libs/packages and for an organization our size that's extremely costly. Any 
opportunity we have to consolidate capabilities in this space helps us 
tremendously. Re: Spark we understand the diversification from MR is coming but 
in many important areas of our business we're only now gaining traction with 
leaders to implement MR-based solutions. We're a large ship and turn slow, so 
all I ask is that there's a long tail for deprecated MR capabilities because 
we'll be slow to convert. 

-----Original Message-----
From: Chandler Burgess [mailto:[email protected]] 
Sent: Monday, April 07, 2014 4:11 PM
To: [email protected]
Subject: RE: Board Report

First, take my opinions with a grain of salt, as I'm sure most will. This is 
basically an anecdote to back up Sean's and Pat's concerns.

I come from an industry (legal) where there is a huge demand for increased 
analytics and machine learning applications. Our stack already includes 
Lucene/Solr, I had heard about Mahout and was curious about applying it to some 
of the things we wanted to do.

I spent around a month playing with Mahout, reading all the documentation and 
articles I could, Mahout In Action, Taming Text, etc. After a month, I came 
away highly disappointed. The documentation in general is very poor, some of 
the drivers are buggy, others unusable because there is basically no 
documentation, examples/potential applications are missing (what the hell can I 
do with Lanczos SVD output? I just want LSI!), and, now, reading more about 
Spark/h20 it leaves me uneasy that anything I write and use Mahout for will 
change in the near future, not to mention another platform/technology 
(potentially 2!) I have to learn. 

It seems far, far away from a 1.0 release, which by all public indications is 
next.

It was attractive from a licensing standpoint, and we will probably still use 
it just for seq2sparse. And that will be about it. We're already putting a 
stack together using other libraries which are better documented, from all 
appearances more stable and feature rich, and faster (though maybe not as 
scalable in some cases).

I have deadlines to meet, deliverables to produce, and other projects to work 
on. As it is, I can't trust Mahout and the learning curve is too steep for 
someone like me to apply this in a production environment without being in a 
much bigger company with a lot more resources.

That said, my opinion would be that ONE direction needs to be chosen as the 
main focus and efforts geared toward that. If it's moving to Spark, which 
sounds awesome, then so be it. Otherwise, I fear Mahout will end up a toy for 
hobbyists, people who are already vested in it, or relegated to the trash bin 
while industry moves on to bigger and better things.

-----Original Message-----
From: Pat Ferrel [mailto:[email protected]]
Sent: Monday, April 07, 2014 1:03 PM
To: [email protected]
Subject: Re: Board Report

Mahout needs a reboot. Grant has the right perspective, but I'd take it 
further. His #2 (two efforts) is not and never would be reasonable in anything 
but a huge company. 

I have never and would never take a team the size of Mahout (even with some new 
commiters) and split a reboot into two parts on two engines. No sane project 
manager would allow this. Why do we think it will work here?

The recent Gigaom article left me sympathetic with how confused the readers 
must be, let alone potential users or contributors.

Sean is not being nihilistic, two directions will not work for Mahout. Mahout 
has a bad reputation already for being a poorly documented and a poorly 
integrated loose collections of code with a lot of technical debt. Honestly has 
anyone reading this seen increasing interest in the project? A reboot is the 
only thing I can imagine to re-energize it and even that must be done with the 
utmost in clear communication.

If you accept the above then there seem to be some ways forward:
1) reboot on Spark, let 0xdata do what they will.
2) reboot on 0xdata and let the Spark commiters consider becoming MLlib 
commiters or other. 
3) fail by issuing confusing direction statements, spending too much time 
supporting and reconciling multiple significantly disparate efforts and 
dividing commiters. This is such a classic fail that I have a hard time even 
considering it.

I'd like to see #1 for what it's worth. A concerted effort by all on #1 would 
ensure Mahout is included in future distros. Maybe even #2 would be included 
but #3? It's a non-starter.

On Apr 7, 2014, at 4:53 AM, Grant Ingersoll <[email protected]> wrote:

To Sean's point, if Mahout were "my company", I would do the following, albeit 
pragmatic and not so pleasant thing, assuming, of course, I had the $$$ to do 
so:

1. Clean up existing code with a laser focus on a few key areas (Sebastian's 
list makes sense) using a part of the team and call it 1.0 and ship it, as it 
has a number of users and they deserve to not have the rug pulled out from 
under them.  

2. Spin out a subset of the team to explore and prototype 2.0 based on two very 
positive and re-energizing looking ideas:
        a. Scala DSL (and maybe Spark)
        b. 0xData

        All of the work for #2 would be done in a clean repo and would only 
bring in legacy code where it was truly beneficial (back compat. can come 
later, if at all).
        It would then benchmark those two approaches as well as look at where 
they overlap and are mutually beneficial and then go forward with the winner.

3. Once #2 is viable, put most effort into it and maintain 1.0 with as minimal 
support as possible, encouraging, neh -- actively helping -- 1.0 customers 
upgrade as quickly as possible.

The tricky part then becomes how do you make sure to still make your sales #'s 
while also convincing them that your roadmap is what they are really buying.

If I didn't have the $$$ to do both of these (i.e. we need a massive turn 
around and we have one last shot), I would be all in on #2.

-----------------------------------

That being said, Mahout is not "my company".  Heck, Mahout is not even a 
"company", so we don't need to be bound by company conventions and thought 
processes, even if that fits with all of our individual day jobs.  And, 
thankfully, we don't have any sales numbers to make.

We are chartered with one and only one mission: produce open source, scalable 
machine learning libraries under the Apache license and community driven 
principles.  We are not required by the Board or anyone else to support version 
X for Y years or to use Hadoop or Scala or Java.  We are also not required to 
implement any specific algorithms or deliver them on specific time frames.  We 
are also not required to provide users upgrade paths or the like.  Naturally, 
we _want_ to do these things for the sake of the community, but let's be clear: 
it is not a requirement from the ASF.  We are, however, required, to have a 
sustaining community. 

------------------------------------

I personally think we should start clean on #2, throwing off the shackles of 
the past and emerge 6-9 months later with Mahout 2.0 (and yes, call it that, 
not 0.1 as Sebastian suggests, for marketing reasons) built on a completely new 
and fresh repository, likely bringing in only the Math/collections 
underpinnings and maybe the build system.  This new repository would have only 
a handful of core algorithms that we know are well implemented, sustainable and 
best in class.  

I think we should look at the lead up to 0.9 as an experiment that proved out a 
lot of interesting ideas, including the fact that Mahout proved there is vast 
interest in open source large scale machine learning and that it is the 
benchmark for comparison.  Not many other ML projects can say that, even if 
they have better technical implementations or are less fragmented.  Once you 
realize something has outlived it's usefulness in software, however, there is 
no point in lingering.

That being said, at least for the foreseeable future, I am not in a position to 
contribute much code.  So, from my perspective, the ASF Meritocratic approach 
takes over:  those who do the work make the decisions.  If you want something 
in, then put up the patch and ask for feedback.  If no one provides feedback, 
assume lazy consensus and move forward.  Nothing convinces people better than 
actual, real, executing code.  For my part, I am happy to continue to work the 
bureaucratic side of things to make sure reports get filed, credentials get 
created, etc. and the occasional patch.  I hope one day I will have time to 
contribute again.

I will follow up w/ a separate email on what I am going to put in the Board 
Report.

On Apr 7, 2014, at 1:52 AM, Sean Owen <[email protected]> wrote:

> No, it's about the opposite. I'm referring to the default, current 
> state of play here.
> 
> The issues for a vendor are demand and supportability. Do people want 
> to pay for support of X? Can you honestly say you have expertise to 
> support and influence X over at least a major release cycle (12-18 
> months)? The latter needs a reasonably reliable roadmap and 
> continuity.
> 
> I'm suggesting that in the current state, demand is low and going 
> down. The current code base seems de facto deprecated/unsupported 
> already, and possibly to be removed or dramatically changed into 
> something as-yet unclear. Nobody here seems to have taken a hard 
> decision regarding a next major release, but, the trajectory of that 
> decision seems clear if the current state remains the same.
> 
> From my perspective, "middle-ground" new directions like adding a bit 
> of H2O, a bit of Spark, leaving bits of M/R code around, etc. are only 
> worse. I can see why there may be a little renewed demand for the new 
> bits, but then, why not go all in on one of them?
> 
> Because a substantially all-new direction is a different story. If a 
> "Mahout2O" or "Spahout" ("Mark"?) emerges as a plan, I could imagine a 
> lot of renewed demand. And a clearer underlying roadmap sounds 
> possible. It would remain to be seen, but there's nothing stopping 
> those ideas from becoming part of a distro too.
> 
> 
> On Mon, Apr 7, 2014 at 6:22 AM, Ted Dunning <[email protected]> wrote:
>> Please be explicit here.  It sounds like you are saying that if 
>> Mahout goes in the proposed new direction that Cloudera will drop Mahout.
>> 
>> Is that what you mean to say?

RE: Board Report

Reply via email to