Re: Thinking about Mahout layout, builds, etc.

Steve Rowe Wed, 30 Jan 2008 05:57:03 -0800

I have experience with Ant, Maven 1.X, and Maven 2.

Maven 1.X builds are a nightmare to maintain - we should not go there.Fortunately for Maven 2, it's a completely different animal.

When your use case is directly supported by Maven 2, it's a beautifulthing. As Grant says, it's magic. Like Grant, I've written M2 pluginsand set up some complex builds.

But unless you pin down the versions of plugins you use (currentlypossible using <dependencyManagement>) and the exact version of M2 youuse (I don't think this is possible yet), people can get differentresults. Maven has never been terribly stable, because it's in aconstant state of change. Maven 2.1 is the current focus ofdevelopment, so modifications to 2.0.X tend to take a long time to bereleased. This has long been the Maven way: focusing on future(backwards-incompatible) versions to the detriment of the existing versions.

If we don't use Maven, then we need to have alternative dependencymanagement and site building facilities, since, unlike Maven, Ant doesnot provide support for these. Maybe at the start we can ignore thesetwo - without code, there isn't much of a site required, and thedependencies will be fairly static (per algorithm).

I have heard good things about Ivy for dependency management, thoughI've never used it. I think it leverages Maven remote repositories.And Lucene uses Forrest to build its site. Both of these things can bebolted on later if we start with an Ant build.

I've changed my mind about the project structure: I think it's okay tostart out with a single source tree. If it makes sense to do so later,splitting algorithms out shouldn't be too hard.

Similarly, I think shipping a monolithic jar is okay to begin with.Size is definitely not an issue, in the short- and medium-term, anyway.


Summarizing my votes:

Build system:
        +0 Maven 2
        +1 Ant

Project structure:
        +0 Per-algorithm source tree
        +1 Single source tree

Release artifact(s):
        +0 Per-algorithm jar
        +1 Monolithic jar

Steve

Grant Ingersoll wrote:

A couple of comments on various things that have come up (btw, I lovethe participation, already!)
1. The structure fits well with Maven or ANT. Personally, I have comefull circle from ANT - Maven - ANT. I have done a lot of ANT buildingand a lot of Maven building, including writing plugins/tasks, etc. ANTis less magic at the cost of a little more upfront work (but it is easyto setup common build functionality, etc.). Magic in your builds is notgood. Maven updates itself automatically, gets jars automatically,etc. I know this sounds like a good thing, but it isn't, IMO.Especially when it comes to the plugins. You have no idea whethereveryone is building on the same base. Maven does not do much toguarantee back-compatibility, either. On the other hand, the Mavenrepository is really nice. And I really like that Maven has convincedpeople that using common file structures and conventions is a good thingin project management. But neither of these things requires Mavenitself. I tend to want to minimize our 3rd party dependencies, anyway,as much as possible. The simpler we can keep this, the better off wewill be.
2. One other good thing from a infrastructure point of view for thesub-project structure is we can, in theory, give permission to acommitter on a single algorithm, much like the contrib modules inLucene. This isn't a big deal, but it could be useful, if someone isreally knowledgeable in one particular area and is only contributing inthat area. Generally, however, I would favor making someone a fullcommitter.
3. I do like the idea of both separate jars and a single uber-jar. Thisis trivial to do in both ANT and Maven.
-Grant

On Jan 30, 2008, at 3:21 AM, Ted Dunning wrote:
And all of Colt is < 1M.
I would say that it isn't all that likely that the library will get tomore
than a few megs (if that).  At that size, it really doesn't matter that
there is a bit of dross along for the ride.

How many here would rather pick and choose pieces out of rapid miner or
weka? Or would you rather just download the comprehensive jar and beready
to roll?
I also think that the example of text translation vs spamcategorization isa bit of a straw man. It is much more likely that these would beentirelyindependent applications that would themselves like to download the(single)
Mahout jar.
On 1/29/08 11:45 PM, "Isabel Drost" <[EMAIL PROTECTED]>wrote:
On Wednesday 30 January 2008, Steve Rowe wrote:
On 01/29/2008 at 6:44 PM, Lukas Vlcek wrote:
I would prefer to have an option not to work with whole library but
select only specific algorithms and optionally their particular
modifications.
+1
+1 I would at least like to have one downloadable jar for each algorithm
family (why would I as a user want to download the functionality for
translating texts, if all I want to do is build a better spamclassificationplugin for spam assassin?) plus one library for the common code likeinput-/
output-filters.

Maybe we should look at other machine learning frameworks that followed
the "all in one jar" path to get a feeling on how large a project caneasilyget. Please be careful with these numbers, as both projects aretrying to
provide whole machine learning frameworks with GUIs for experimentation,
algorithms for evaluation and the like.

Weka                         Compiled: 4.4M
Rapid Miner Sources: 12M Compiled: 4.5M (21M including alldependencies)
Isabel

Re: Thinking about Mahout layout, builds, etc.

Reply via email to