On Wed, Jul 4, 2012 at 9:25 PM, Isabel Drost <[email protected]> wrote:

> b) When telling the story of Mahout I am faced with comments along the lines 
> of
> "I don't want to install Hadoop to run it" over and over again. It seems non-
> obvious that Mahout actually aims to be a stable, scalable Java library that
> comes with the added benefit of having quite a few of it's algorithms
> implemented on Hadoop.

Personally, I think that Mahout equals Hadoop-based. I don't think
it's so useful to pretend otherwise, and do not see an issue with this
identity. It does not have consistent implementations based on
anything else. Implementations based on a very different framework
sounds like a different project in itself.


> - mark all long running tests with the nightly annotation - my goal here is 
> not
> to switch them off forever but rather draw contributors' attention to those
> running particularly long (>20s) and fix them

There's a scene in a movie I love called Kicking and Screaming (not
the Will Ferrell one), where someone drops a glass on the kitchen
floor. The housemates are quintessentially lazy, and so its left to
the next day. But one takes action the next day -- he places a
carefully lettered sign reading "Watch Out" over the broken glass and
moves on.

Annotations with this purpose remind me of this scene.

These aren't going to get fixed; these half-hour tests have been in
for a year with notes to fix. They're kinda broken tests; fix them by
removing them. It's less good than fixing, but better than people not
running tests since they take like an hour to finish.


> - convert our current core module into a parent, move any code in that into a
> submodule called stuff
>
> - move anything out of stuff and into module write that concerns serialization
> and is reasonably algorithm independent
>
> - move anything out of stuff and into module hadoop that really needs 
> mapreduce
> to run
>
> - move anything out of stuff and into cli that offers just a command line
> interface to implementations (I might have missed some jobs here that still
> contain logic in addition to the command line stuff, all I did was to go 
> through
> and fix failing tests, for several jobs I factored the parameters into 
> separate
> beans to deal with default values, I suppose some of Frank's work could come 
> in
> handy when doing that right.)
>
> - factor some of the unit testing utils into their own modules (those two 
> could
> be collapsed actually) to avoid depending on running the tests just for
> compiling all the source code.

I think I'm against this -- it's complex, and i think it supposes that
the code is meaningfully separable along these lines. I don't think it
is. Everything is fairly tied in with Hadoop and its serialization.
Weigh that against the complication of making users -- who are going
to want all the Hadoop stuff -- now import 4-5 JARs / artifacts
instead of 2-3, which already feels like 1-2 too many.

There is a lot of messy stuff that needs to be refactored, and maybe
these ideas encompass some of them, but I don't think any I can see
are at the module level. It's dirtier and harder than that.

Reply via email to