Re: Board Report

Sebastian Schelter Mon, 07 Apr 2014 05:07:33 -0700

I think we should mention the redesign/rework of the website and thecompletion of the move from the old wiki to Apache CMS.


--sebastian


On 04/07/2014 02:04 PM, Grant Ingersoll wrote:

Here is my proposed report.  For the most part, I think the only right thing to 
do vis-a-vis the Board is to report that we are in the midst of a healthy (yes, 
I believe it is, for the most part healthy and normal) discussion on where to 
go next.

PMC Members: this is checked into SVN at 
https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt.
  It is due on Wednesday.  If you object to this approach of reporting, please 
let me know ASAP and suggest alternatives.

=== Apache Mahout Status Report: April 2014 ===

-----

Apache Mahout has implementations of a wide range of machine learning and
data mining algorithms: clustering, classification, collaborative filtering
and frequent pattern mining

Project Status
--------------

The project continues to have a large and active user base.  While
the developer base has continued to grow, there is a very active
and healthy debate going on about where Mahout goes next.  Please
see the Issues section below for more details.

Community
---------

* Andrew Musselman was voted in as new committer.
* No changes to the PMC in the reporting period.

* The main issue concerning the community right now is the addition
of new contributions from 0xData and the integration of Mahout with Spark.

Community Objectives
--------------------

Our goal is to build scalable machine learning libraries. See the Issues
section below for the debate in the community about our objectives.


Releases
--------

In addition to an ongoing debate on Mahout's future, the community is actively
  working on integrating Mahout with Scala/Spark, updating
documentation, and bringing in new code and committers to update the core 
project.


Issues
------
The Mahout community is at a crossroads in terms of where
to go next.  While the project has a broad number of users and interested
parties, most committers are trying to maintain the code base on a purely
part time basis, when the amount of work to sustain these users
clearly points to it needing to
be full time.  Furthermore, much of our original code base is written
for Hadoop MapReduce 1.0, which many in the community have come to realize
is not well-suited for solving the kinds of problems that Mahout has set
out to solve.  There have been several lengthy discussions and prototypes
going on to work out next directions along the lines of the Spark and
0xData contributions (there are numerous threads on the [email protected]
mailing list.)

The PMC does not think this requires Board intervention at this time
as the debate is, as far as we can tell, healthy.  We do, however,
expect that this debate will take some time to resolve and may mean we
won't be shipping a 1.0 release any time soon.  We will keep the Board
apprised of our next steps as we work through the process.




On Apr 7, 2014, at 4:53 AM, Grant Ingersoll <[email protected]> wrote:

To Sean's point, if Mahout were "my company", I would do the following, albeit
pragmatic and not so pleasant thing, assuming, of course, I had the $$$ to do so:

1. Clean up existing code with a laser focus on a few key areas (Sebastian's
list makes sense) using a part of the team and call it 1.0 and ship it, as it
has a number of users and they deserve to not have the rug pulled out from
under them.

2. Spin out a subset of the team to explore and prototype 2.0 based on two very
positive and re-energizing looking ideas:
a. Scala DSL (and maybe Spark)
b. 0xData

All of the work for #2 would be done in a clean repo and would only
bring in legacy code where it was truly beneficial (back compat. can come
later, if at all).
It would then benchmark those two approaches as well as look at where
they overlap and are mutually beneficial and then go forward with the winner.

3. Once #2 is viable, put most effort into it and maintain 1.0 with as minimal
support as possible, encouraging, neh -- actively helping -- 1.0 customers
upgrade as quickly as possible.

The tricky part then becomes how do you make sure to still make your sales #'s
while also convincing them that your roadmap is what they are really buying.

If I didn't have the $$$ to do both of these (i.e. we need a massive turn
around and we have one last shot), I would be all in on #2.

-----------------------------------

That being said, Mahout is not "my company". Heck, Mahout is not even a
"company", so we don't need to be bound by company conventions and thought processes,
even if that fits with all of our individual day jobs. And, thankfully, we don't have any sales
numbers to make.

We are chartered with one and only one mission: produce open source, scalable
machine learning libraries under the Apache license and community driven
principles. We are not required by the Board or anyone else to support version
X for Y years or to use Hadoop or Scala or Java. We are also not required to
implement any specific algorithms or deliver them on specific time frames. We
are also not required to provide users upgrade paths or the like. Naturally,
we _want_ to do these things for the sake of the community, but let's be clear:
it is not a requirement from the ASF. We are, however, required, to have a
sustaining community.

------------------------------------

I personally think we should start clean on #2, throwing off the shackles of
the past and emerge 6-9 months later with Mahout 2.0 (and yes, call it that,
not 0.1 as Sebastian suggests, for marketing reasons) built on a completely new
and fresh repository, likely bringing in only the Math/collections
underpinnings and maybe the build system. This new repository would have only
a handful of core algorithms that we know are well implemented, sustainable and
best in class.

I think we should look at the lead up to 0.9 as an experiment that proved out a
lot of interesting ideas, including the fact that Mahout proved there is vast
interest in open source large scale machine learning and that it is the
benchmark for comparison. Not many other ML projects can say that, even if
they have better technical implementations or are less fragmented. Once you
realize something has outlived it's usefulness in software, however, there is
no point in lingering.

That being said, at least for the foreseeable future, I am not in a position to
contribute much code. So, from my perspective, the ASF Meritocratic approach
takes over: those who do the work make the decisions. If you want something
in, then put up the patch and ask for feedback. If no one provides feedback,
assume lazy consensus and move forward. Nothing convinces people better than
actual, real, executing code. For my part, I am happy to continue to work the
bureaucratic side of things to make sure reports get filed, credentials get
created, etc. and the occasional patch. I hope one day I will have time to
contribute again.

I will follow up w/ a separate email on what I am going to put in the Board
Report.

On Apr 7, 2014, at 1:52 AM, Sean Owen <[email protected]> wrote:

No, it's about the opposite. I'm referring to the default, current
state of play here.

The issues for a vendor are demand and supportability. Do people want
to pay for support of X? Can you honestly say you have expertise to
support and influence X over at least a major release cycle (12-18
months)? The latter needs a reasonably reliable roadmap and
continuity.

I'm suggesting that in the current state, demand is low and going
down. The current code base seems de facto deprecated/unsupported
already, and possibly to be removed or dramatically changed into
something as-yet unclear. Nobody here seems to have taken a hard
decision regarding a next major release, but, the trajectory of that
decision seems clear if the current state remains the same.

 From my perspective, "middle-ground" new directions like adding a bit
of H2O, a bit of Spark, leaving bits of M/R code around, etc. are only
worse. I can see why there may be a little renewed demand for the new
bits, but then, why not go all in on one of them?

Because a substantially all-new direction is a different story. If a
"Mahout2O" or "Spahout" ("Mark"?) emerges as a plan, I could imagine a
lot of renewed demand. And a clearer underlying roadmap sounds
possible. It would remain to be seen, but there's nothing stopping
those ideas from becoming part of a distro too.


On Mon, Apr 7, 2014 at 6:22 AM, Ted Dunning <[email protected]> wrote:

Please be explicit here.  It sounds like you are saying that if Mahout goes
in the proposed new direction that Cloudera will drop Mahout.

Is that what you mean to say?


--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com

Re: Board Report

Reply via email to