Right! Even more black box is needed. Mahout is not for scientists, it is for 
app devs, some of which are trying desperately to learn the math. Not people 
coming from R and yearning to do scalable app dev.

I guess I’ve already said that the Mahout reboot should be first on Spark and 
leave the other engines to work out their own integrations. 

On Apr 7, 2014, at 10:45 AM, Sebastian Schelter <s...@apache.org> wrote:

A few questions here.

@Dmitriy I very much share your vision. I just think that our targeted userbase 
is more than people wanting to implement their own algorithms using high level 
constructs in a modern language like Scala. 

I think there is still a huge demand for a blackbox of algorithms that allows 
to easily build a model without having to know too much about the underlying 
math. Our recommenders are a good example. Provide data in a simple CSV format, 
throw ItemSimilarityJob on that and use the item similarites for 
recommendations.

Do you suggest we should leave the blackbox stuff to MLBase/Oryx and solely 
focus on providing high level ML constructs?

@Sean How much can you agree on the vision I suggested? It meets your demand of 
having a plan to solve the problems with the MR codebase (by getting rid of it 
in the near future) and provides a direction for Spark as the new underlying 
execution system, with optional support for Stratosphere and H20, if those 
communities manage to convince us that it is worth to integrate.

--sebastian




2014-04-07 19:29 GMT+02:00 Pat Ferrel <p...@occamsmachete.com>:
The document does not mention the state of the existing Spark work in the 
snapshot codebase. Shouldn’t this be noted?

On Apr 7, 2014, at 5:06 AM, Sebastian Schelter <s...@apache.org> wrote:

I think we should mention the redesign/rework of the website and the completion 
of the move from the old wiki to Apache CMS.

--sebastian

On 04/07/2014 02:04 PM, Grant Ingersoll wrote:
> Here is my proposed report.  For the most part, I think the only right thing 
> to do vis-a-vis the Board is to report that we are in the midst of a healthy 
> (yes, I believe it is, for the most part healthy and normal) discussion on 
> where to go next.
>
> PMC Members: this is checked into SVN at 
> https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt.
>   It is due on Wednesday.  If you object to this approach of reporting, 
> please let me know ASAP and suggest alternatives.
>
> === Apache Mahout Status Report: April 2014 ===
>
> -----
>
> Apache Mahout has implementations of a wide range of machine learning and
> data mining algorithms: clustering, classification, collaborative filtering
> and frequent pattern mining
>
> Project Status
> --------------
>
> The project continues to have a large and active user base.  While
> the developer base has continued to grow, there is a very active
> and healthy debate going on about where Mahout goes next.  Please
> see the Issues section below for more details.
>
> Community
> ---------
>
> * Andrew Musselman was voted in as new committer.
> * No changes to the PMC in the reporting period.
>
> * The main issue concerning the community right now is the addition
> of new contributions from 0xData and the integration of Mahout with Spark.
>
> Community Objectives
> --------------------
>
> Our goal is to build scalable machine learning libraries. See the Issues
> section below for the debate in the community about our objectives.
>
>
> Releases
> --------
>
> In addition to an ongoing debate on Mahout's future, the community is actively
>  working on integrating Mahout with Scala/Spark, updating
> documentation, and bringing in new code and committers to update the core 
> project.
>
>
> Issues
> ------
> The Mahout community is at a crossroads in terms of where
> to go next.  While the project has a broad number of users and interested
> parties, most committers are trying to maintain the code base on a purely
> part time basis, when the amount of work to sustain these users
> clearly points to it needing to
> be full time.  Furthermore, much of our original code base is written
> for Hadoop MapReduce 1.0, which many in the community have come to realize
> is not well-suited for solving the kinds of problems that Mahout has set
> out to solve.  There have been several lengthy discussions and prototypes
> going on to work out next directions along the lines of the Spark and
> 0xData contributions (there are numerous threads on the dev@mahout.a.o
> mailing list.)
>
> The PMC does not think this requires Board intervention at this time
> as the debate is, as far as we can tell, healthy.  We do, however,
> expect that this debate will take some time to resolve and may mean we
> won't be shipping a 1.0 release any time soon.  We will keep the Board
> apprised of our next steps as we work through the process.
>
>
>
>
> On Apr 7, 2014, at 4:53 AM, Grant Ingersoll <gsing...@apache.org> wrote:
>
>> To Sean's point, if Mahout were "my company", I would do the following, 
>> albeit pragmatic and not so pleasant thing, assuming, of course, I had the 
>> $$$ to do so:
>>
>> 1. Clean up existing code with a laser focus on a few key areas (Sebastian's 
>> list makes sense) using a part of the team and call it 1.0 and ship it, as 
>> it has a number of users and they deserve to not have the rug pulled out 
>> from under them.
>>
>> 2. Spin out a subset of the team to explore and prototype 2.0 based on two 
>> very positive and re-energizing looking ideas:
>>      a. Scala DSL (and maybe Spark)
>>      b. 0xData
>>
>>      All of the work for #2 would be done in a clean repo and would only 
>> bring in legacy code where it was truly beneficial (back compat. can come 
>> later, if at all).
>>      It would then benchmark those two approaches as well as look at where 
>> they overlap and are mutually beneficial and then go forward with the winner.
>>
>> 3. Once #2 is viable, put most effort into it and maintain 1.0 with as 
>> minimal support as possible, encouraging, neh -- actively helping -- 1.0 
>> customers upgrade as quickly as possible.
>>
>> The tricky part then becomes how do you make sure to still make your sales 
>> #'s while also convincing them that your roadmap is what they are really 
>> buying.
>>
>> If I didn't have the $$$ to do both of these (i.e. we need a massive turn 
>> around and we have one last shot), I would be all in on #2.
>>
>> -----------------------------------
>>
>> That being said, Mahout is not "my company".  Heck, Mahout is not even a 
>> "company", so we don't need to be bound by company conventions and thought 
>> processes, even if that fits with all of our individual day jobs.  And, 
>> thankfully, we don't have any sales numbers to make.
>>
>> We are chartered with one and only one mission: produce open source, 
>> scalable machine learning libraries under the Apache license and community 
>> driven principles.  We are not required by the Board or anyone else to 
>> support version X for Y years or to use Hadoop or Scala or Java.  We are 
>> also not required to implement any specific algorithms or deliver them on 
>> specific time frames.  We are also not required to provide users upgrade 
>> paths or the like.  Naturally, we _want_ to do these things for the sake of 
>> the community, but let's be clear: it is not a requirement from the ASF.  We 
>> are, however, required, to have a sustaining community.
>>
>> ------------------------------------
>>
>> I personally think we should start clean on #2, throwing off the shackles of 
>> the past and emerge 6-9 months later with Mahout 2.0 (and yes, call it that, 
>> not 0.1 as Sebastian suggests, for marketing reasons) built on a completely 
>> new and fresh repository, likely bringing in only the Math/collections 
>> underpinnings and maybe the build system.  This new repository would have 
>> only a handful of core algorithms that we know are well implemented, 
>> sustainable and best in class.
>>
>> I think we should look at the lead up to 0.9 as an experiment that proved 
>> out a lot of interesting ideas, including the fact that Mahout proved there 
>> is vast interest in open source large scale machine learning and that it is 
>> the benchmark for comparison.  Not many other ML projects can say that, even 
>> if they have better technical implementations or are less fragmented.  Once 
>> you realize something has outlived it's usefulness in software, however, 
>> there is no point in lingering.
>>
>> That being said, at least for the foreseeable future, I am not in a position 
>> to contribute much code.  So, from my perspective, the ASF Meritocratic 
>> approach takes over:  those who do the work make the decisions.  If you want 
>> something in, then put up the patch and ask for feedback.  If no one 
>> provides feedback, assume lazy consensus and move forward.  Nothing 
>> convinces people better than actual, real, executing code.  For my part, I 
>> am happy to continue to work the bureaucratic side of things to make sure 
>> reports get filed, credentials get created, etc. and the occasional patch.  
>> I hope one day I will have time to contribute again.
>>
>> I will follow up w/ a separate email on what I am going to put in the Board 
>> Report.
>>
>> On Apr 7, 2014, at 1:52 AM, Sean Owen <sro...@gmail.com> wrote:
>>
>>> No, it's about the opposite. I'm referring to the default, current
>>> state of play here.
>>>
>>> The issues for a vendor are demand and supportability. Do people want
>>> to pay for support of X? Can you honestly say you have expertise to
>>> support and influence X over at least a major release cycle (12-18
>>> months)? The latter needs a reasonably reliable roadmap and
>>> continuity.
>>>
>>> I'm suggesting that in the current state, demand is low and going
>>> down. The current code base seems de facto deprecated/unsupported
>>> already, and possibly to be removed or dramatically changed into
>>> something as-yet unclear. Nobody here seems to have taken a hard
>>> decision regarding a next major release, but, the trajectory of that
>>> decision seems clear if the current state remains the same.
>>>
>>> From my perspective, "middle-ground" new directions like adding a bit
>>> of H2O, a bit of Spark, leaving bits of M/R code around, etc. are only
>>> worse. I can see why there may be a little renewed demand for the new
>>> bits, but then, why not go all in on one of them?
>>>
>>> Because a substantially all-new direction is a different story. If a
>>> "Mahout2O" or "Spahout" ("Mark"?) emerges as a plan, I could imagine a
>>> lot of renewed demand. And a clearer underlying roadmap sounds
>>> possible. It would remain to be seen, but there's nothing stopping
>>> those ideas from becoming part of a distro too.
>>>
>>>
>>> On Mon, Apr 7, 2014 at 6:22 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>>>> Please be explicit here.  It sounds like you are saying that if Mahout goes
>>>> in the proposed new direction that Cloudera will drop Mahout.
>>>>
>>>> Is that what you mean to say?
>>
>>
>
> --------------------------------------------
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
>
>
>
>
>




Reply via email to