If I may add deprecating Canopy clustering to the list once we get
Streaming KMeans working right.
On Sun, Apr 13, 2014 at 12:45 PM, Sebastian Schelter <[email protected]> wrote:
Hi,
I took some days to let the latest discussion about the state and future
of Mahout go through my head. I think the most important thing to address
right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
are currently unmaintained, documentation is outdated and the original
authors have abandoned Mahout. For some algorithms it is hard to get even
questions answered on the mailinglist (e.g. RandomForest). I agree with
Sean's comments that letting the code linger around is no option and will
continue to harm Mahout.
In the previous discussion, I suggested to make a radical move and aim to
delete this codebase, but there were serious objections from committers and
users that convinced me that there is still usage of and interested in that
codebase.
That puts us into a "legacy dilemma". We cannot delete the code without
harming our userbase. On the other hand, I don't see anyone willing to
rework the codebase. Further, the code cannot linger around anymore as it
is doing now, especially when we fail to answer questions or don't provide
documentation.
*We have to make a move*!
I suggest the following actions with regard to the MR codebase. I hope
that they find consent. If there are objections, please give alternatives,
*keeping everything as-is is not an option*:
* reject any future MR algorithm contributions, prominently state this on
the website and in talks
+1, this includes the new Frequent Pattern mining impl which is MR
based that was provided as a patch few months ago
* make all existing algorithm code compatible with Hadoop 2, if there is
no one willing to make an existing algorithm compatible, remove the
algorithm
+1. One of the questions I got asked when 0.9 was released was 'when
is Mahout gonna be compatible with Yarn and Hadoop 2'? We should target
that for the next major//interim release.
* deprecate the existing MR algorithms, yet still take bug fix
contributions
I guess we'll be removing these in some future release, until then we
keep absorbing bug fixes ??
* remove Random Forest as we cannot even answer questions to the
implementation on the mailinglist
+1 to removing present Random Forests. Andy Twigg had provided a Spark
based Streaming Random Forests impl sometime last year. Its time to restart
that conversation and integrate that into the codebase if the contributor
is still willing i.e.
There are two more actions that I would like to see, but'd be willing to
give up if there are objections:
* move the MR algorithms into a separate maven module
+1
* remove Frequent Pattern Mining again (we already aimed for that in 0.9
but had one user who shouted but never returned to us)
This thing annoys me the most. We had removed this from 0.9 but yet
restored it only because some user wanted it and promised to support it. We
have not heard from the user again.
Its got old MR code that we don't support anymore and this should be
purged ASAP.
Let me know what you think.
--sebastian