On Wed, Jul 4, 2012 at 9:25 PM, Isabel Drost <[email protected]> wrote:
> b) When telling the story of Mahout I am faced with comments along the lines > of > "I don't want to install Hadoop to run it" over and over again. It seems non- > obvious that Mahout actually aims to be a stable, scalable Java library that > comes with the added benefit of having quite a few of it's algorithms > implemented on Hadoop. Personally, I think that Mahout equals Hadoop-based. I don't think it's so useful to pretend otherwise, and do not see an issue with this identity. It does not have consistent implementations based on anything else. Implementations based on a very different framework sounds like a different project in itself. > - mark all long running tests with the nightly annotation - my goal here is > not > to switch them off forever but rather draw contributors' attention to those > running particularly long (>20s) and fix them There's a scene in a movie I love called Kicking and Screaming (not the Will Ferrell one), where someone drops a glass on the kitchen floor. The housemates are quintessentially lazy, and so its left to the next day. But one takes action the next day -- he places a carefully lettered sign reading "Watch Out" over the broken glass and moves on. Annotations with this purpose remind me of this scene. These aren't going to get fixed; these half-hour tests have been in for a year with notes to fix. They're kinda broken tests; fix them by removing them. It's less good than fixing, but better than people not running tests since they take like an hour to finish. > - convert our current core module into a parent, move any code in that into a > submodule called stuff > > - move anything out of stuff and into module write that concerns serialization > and is reasonably algorithm independent > > - move anything out of stuff and into module hadoop that really needs > mapreduce > to run > > - move anything out of stuff and into cli that offers just a command line > interface to implementations (I might have missed some jobs here that still > contain logic in addition to the command line stuff, all I did was to go > through > and fix failing tests, for several jobs I factored the parameters into > separate > beans to deal with default values, I suppose some of Frank's work could come > in > handy when doing that right.) > > - factor some of the unit testing utils into their own modules (those two > could > be collapsed actually) to avoid depending on running the tests just for > compiling all the source code. I think I'm against this -- it's complex, and i think it supposes that the code is meaningfully separable along these lines. I don't think it is. Everything is fairly tied in with Hadoop and its serialization. Weigh that against the complication of making users -- who are going to want all the Hadoop stuff -- now import 4-5 JARs / artifacts instead of 2-3, which already feels like 1-2 too many. There is a lot of messy stuff that needs to be refactored, and maybe these ideas encompass some of them, but I don't think any I can see are at the module level. It's dirtier and harder than that.
