Re: Tackling the "legacy dilemma"

Manoj Awasthi Tue, 15 Apr 2014 21:47:21 -0700

Ok - that makes sense. Thanks.


On Wed, Apr 16, 2014 at 8:29 AM, Suneel Marthi <[email protected]> wrote:

> The plan is to replace the existing Random Forests impl with a spark based
> Streaming Random Forests.
> As ssc had already mentioned the plan is not to entertain any new MR impls
> but accept bug fixes for existing ones.
>
>
> The consensus is to do away with existing MapReduce RF once the Spark
> based Streaming Random Forests is in place.
>
>
> On Tue, Apr 15, 2014 at 10:51 PM, Manoj Awasthi 
> <[email protected]>wrote:
>
>>
>> >  * remove Random Forest as we cannot even answer questions to the
>> > implementation on the mailinglist
>> >
>>      -1 to removing present Random Forests. I think it is being used - we
>> (at adobe) are playing around with it a bit.  If the reason for removal is
>> that there no active maintainer that can be resolved by people using it
>> getting more active on this - a community action. FWIW, I vote against
>> throwing away this code.
>>
>>
>>
>> On Tue, Apr 15, 2014 at 2:38 PM, Sebastian Schelter <[email protected]>wrote:
>>
>>> On 04/15/2014 11:07 AM, Suneel Marthi wrote:
>>>
>>>> On Tue, Apr 15, 2014 at 12:57 AM, Sebastian Schelter <[email protected]>
>>>> wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>>  From reading the thread, I have the impression that we agree on the
>>>>> following actions:
>>>>>
>>>>>
>>>>>   * reject any future MR algorithm contributions, prominently state
>>>>> this
>>>>> on the website and in talks
>>>>>   * make all existing algorithm code compatible with Hadoop 2, if
>>>>> there is
>>>>> no one willing to make an existing algorithm compatible, remove the
>>>>> algorithm
>>>>>   * deprecate Canopy clustering
>>>>>   * email the original FPM and random forest authors to ask for
>>>>> maintenance
>>>>> of the algorithms
>>>>>   * rename core to "mr-legacy" (and  gradually pull items we really
>>>>> need
>>>>> out of that later)
>>>>>
>>>>> I will create jira tickets for those action points. I think the biggest
>>>>> challenge here is the Hadoop 2 compatibility, is someone volunteering
>>>>> to
>>>>> drive that? Would be awesome.
>>>>>
>>>>>
>>>> With things settling down at work for me, I have time now to dedicate
>>>> back
>>>> to Mahout. I can drive this effort.
>>>>
>>>
>>> That is great news!
>>>
>>>
>>>
>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>>
>>>>> On 04/13/2014 07:19 PM, Andrew Musselman wrote:
>>>>>
>>>>>  This is a good summary of how I feel too.
>>>>>>
>>>>>>   On Apr 13, 2014, at 10:15 AM, Sebastian Schelter <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Unfortunately, its not that easy to get enough voluntary work. I
>>>>>>> issued
>>>>>>> the third call for working on the documentation today as there are
>>>>>>> still
>>>>>>> lots of open issues. That's why I'm trying to suggest a move that
>>>>>>> involves
>>>>>>> as few work as possible.
>>>>>>>
>>>>>>> We should get the MR codebase into a state that we all can live with
>>>>>>> and
>>>>>>> then focus on new stuff like the scala DSL.
>>>>>>>
>>>>>>> --sebastian
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
>>>>>>>
>>>>>>>> The best thing, should be do a plan, and see how much effort do you
>>>>>>>> need to
>>>>>>>> this. Then find out voluntaries to accomplish the task. Quite sure
>>>>>>>> that
>>>>>>>> there a lot of people around there that they are willing to help
>>>>>>>> out.
>>>>>>>>
>>>>>>>> BR,
>>>>>>>> deneb.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter <[email protected]>:
>>>>>>>>
>>>>>>>>
>>>>>>>>   Hi,
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I took some days to let the latest discussion about the state and
>>>>>>>>> future
>>>>>>>>> of Mahout go through my head. I think the most important thing to
>>>>>>>>> address
>>>>>>>>> right now is the MapReduce "legacy" codebase. A lot of the MR
>>>>>>>>> algorithms
>>>>>>>>> are currently unmaintained, documentation is outdated and the
>>>>>>>>> original
>>>>>>>>> authors have abandoned Mahout. For some algorithms it is hard to
>>>>>>>>> get
>>>>>>>>> even
>>>>>>>>> questions answered on the mailinglist (e.g. RandomForest). I agree
>>>>>>>>> with
>>>>>>>>> Sean's comments that letting the code linger around is no option
>>>>>>>>> and
>>>>>>>>> will
>>>>>>>>> continue to harm Mahout.
>>>>>>>>>
>>>>>>>>> In the previous discussion, I suggested to make a radical move and
>>>>>>>>> aim
>>>>>>>>> to
>>>>>>>>> delete this codebase, but there were serious objections from
>>>>>>>>> committers and
>>>>>>>>> users that convinced me that there is still usage of and
>>>>>>>>> interested in
>>>>>>>>> that
>>>>>>>>> codebase.
>>>>>>>>>
>>>>>>>>> That puts us into a "legacy dilemma". We cannot delete the code
>>>>>>>>> without
>>>>>>>>> harming our userbase. On the other hand, I don't see anyone
>>>>>>>>> willing to
>>>>>>>>> rework the codebase. Further, the code cannot linger around
>>>>>>>>> anymore as
>>>>>>>>> it
>>>>>>>>> is doing now, especially when we fail to answer questions or don't
>>>>>>>>> provide
>>>>>>>>> documentation.
>>>>>>>>>
>>>>>>>>> *We have to make a move*!
>>>>>>>>>
>>>>>>>>> I suggest the following actions with regard to the MR codebase. I
>>>>>>>>> hope
>>>>>>>>> that they find consent. If there are objections, please give
>>>>>>>>> alternatives,
>>>>>>>>> *keeping everything as-is is not an option*:
>>>>>>>>>
>>>>>>>>>    * reject any future MR algorithm contributions, prominently
>>>>>>>>> state
>>>>>>>>> this on
>>>>>>>>> the website and in talks
>>>>>>>>>    * make all existing algorithm code compatible with Hadoop 2, if
>>>>>>>>> there is
>>>>>>>>> no one willing to make an existing algorithm compatible, remove the
>>>>>>>>> algorithm
>>>>>>>>>    * deprecate the existing MR algorithms, yet still take bug fix
>>>>>>>>> contributions
>>>>>>>>>    * remove Random Forest as we cannot even answer questions to the
>>>>>>>>> implementation on the mailinglist
>>>>>>>>>
>>>>>>>>> There are two more actions that I would like to see, but'd be
>>>>>>>>> willing
>>>>>>>>> to
>>>>>>>>> give up if there are objections:
>>>>>>>>>
>>>>>>>>>    * move the MR algorithms into a separate maven module
>>>>>>>>>    * remove Frequent Pattern Mining again (we already aimed for
>>>>>>>>> that in
>>>>>>>>> 0.9
>>>>>>>>> but had one user who shouted but never returned to us)
>>>>>>>>>
>>>>>>>>> Let me know what you think.
>>>>>>>>>
>>>>>>>>> --sebastian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Tackling the "legacy dilemma"

Reply via email to