I meant to deprecate first (and eventually remove) Canopy clustering. This is in line with the conversation I had with Ted and Frank at AMS about weaning users away from the old style Canopy->KMeans clustering to start using Streaming KMeans. No point in keeping Canopy once users switch to using Streaming KMeans.
On Sun, Apr 13, 2014 at 1:12 PM, Sebastian Schelter <[email protected]> wrote: > Do you mean deprecating or removing Canopy clustering? I suggest to > deprecate all MR code anyways. > > --sebastian > > > > On 04/13/2014 07:11 PM, Suneel Marthi wrote: > > If I may add deprecating Canopy clustering to the list once we get >> Streaming KMeans working right. >> >> On Sun, Apr 13, 2014 at 12:45 PM, Sebastian Schelter <[email protected]> >> wrote: >> >> Hi, >>> >>> I took some days to let the latest discussion about the state and future >>> of Mahout go through my head. I think the most important thing to address >>> right now is the MapReduce "legacy" codebase. A lot of the MR algorithms >>> are currently unmaintained, documentation is outdated and the original >>> authors have abandoned Mahout. For some algorithms it is hard to get even >>> questions answered on the mailinglist (e.g. RandomForest). I agree with >>> Sean's comments that letting the code linger around is no option and will >>> continue to harm Mahout. >>> >>> In the previous discussion, I suggested to make a radical move and aim to >>> delete this codebase, but there were serious objections from committers >>> and >>> users that convinced me that there is still usage of and interested in >>> that >>> codebase. >>> >>> That puts us into a "legacy dilemma". We cannot delete the code without >>> harming our userbase. On the other hand, I don't see anyone willing to >>> rework the codebase. Further, the code cannot linger around anymore as it >>> is doing now, especially when we fail to answer questions or don't >>> provide >>> documentation. >>> >>> *We have to make a move*! >>> >>> I suggest the following actions with regard to the MR codebase. I hope >>> that they find consent. If there are objections, please give >>> alternatives, >>> *keeping everything as-is is not an option*: >>> >>> * reject any future MR algorithm contributions, prominently state this >>> on >>> the website and in talks >>> >>> +1, this includes the new Frequent Pattern mining impl which is MR >> based that was provided as a patch few months ago >> >> * make all existing algorithm code compatible with Hadoop 2, if there >>> is >>> no one willing to make an existing algorithm compatible, remove the >>> algorithm >>> >>> +1. One of the questions I got asked when 0.9 was released was >> 'when >> is Mahout gonna be compatible with Yarn and Hadoop 2'? We should target >> that for the next major//interim release. >> >> * deprecate the existing MR algorithms, yet still take bug fix >>> contributions >>> >>> I guess we'll be removing these in some future release, until >> then we >> keep absorbing bug fixes ?? >> >> >> * remove Random Forest as we cannot even answer questions to the >>> implementation on the mailinglist >>> >>> +1 to removing present Random Forests. Andy Twigg had provided a >> Spark >> based Streaming Random Forests impl sometime last year. Its time to >> restart >> that conversation and integrate that into the codebase if the contributor >> is still willing i.e. >> >> >>> There are two more actions that I would like to see, but'd be willing to >>> give up if there are objections: >>> >>> * move the MR algorithms into a separate maven module >>> >>> +1 >> >> * remove Frequent Pattern Mining again (we already aimed for that in >>> 0.9 >>> but had one user who shouted but never returned to us) >>> >>> This thing annoys me the most. We had removed this from 0.9 but >> yet >> restored it only because some user wanted it and promised to support it. >> We >> have not heard from the user again. >> Its got old MR code that we don't support anymore and this should >> be >> purged ASAP. >> >> >> >> Let me know what you think. >>> >>> --sebastian >>> >>> >> >
