I have experience with Ant, Maven 1.X, and Maven 2.

Maven 1.X builds are a nightmare to maintain - we should not go there. Fortunately for Maven 2, it's a completely different animal.

When your use case is directly supported by Maven 2, it's a beautiful thing. As Grant says, it's magic. Like Grant, I've written M2 plugins and set up some complex builds.

But unless you pin down the versions of plugins you use (currently possible using <dependencyManagement>) and the exact version of M2 you use (I don't think this is possible yet), people can get different results. Maven has never been terribly stable, because it's in a constant state of change. Maven 2.1 is the current focus of development, so modifications to 2.0.X tend to take a long time to be released. This has long been the Maven way: focusing on future (backwards-incompatible) versions to the detriment of the existing versions.

If we don't use Maven, then we need to have alternative dependency management and site building facilities, since, unlike Maven, Ant does not provide support for these. Maybe at the start we can ignore these two - without code, there isn't much of a site required, and the dependencies will be fairly static (per algorithm).

I have heard good things about Ivy for dependency management, though I've never used it. I think it leverages Maven remote repositories. And Lucene uses Forrest to build its site. Both of these things can be bolted on later if we start with an Ant build.

I've changed my mind about the project structure: I think it's okay to start out with a single source tree. If it makes sense to do so later, splitting algorithms out shouldn't be too hard.

Similarly, I think shipping a monolithic jar is okay to begin with. Size is definitely not an issue, in the short- and medium-term, anyway.

Summarizing my votes:

Build system:
        +0 Maven 2
        +1 Ant

Project structure:
        +0 Per-algorithm source tree
        +1 Single source tree

Release artifact(s):
        +0 Per-algorithm jar
        +1 Monolithic jar

Steve

Grant Ingersoll wrote:
A couple of comments on various things that have come up (btw, I love the participation, already!)

1. The structure fits well with Maven or ANT. Personally, I have come full circle from ANT - Maven - ANT. I have done a lot of ANT building and a lot of Maven building, including writing plugins/tasks, etc. ANT is less magic at the cost of a little more upfront work (but it is easy to setup common build functionality, etc.). Magic in your builds is not good. Maven updates itself automatically, gets jars automatically, etc. I know this sounds like a good thing, but it isn't, IMO. Especially when it comes to the plugins. You have no idea whether everyone is building on the same base. Maven does not do much to guarantee back-compatibility, either. On the other hand, the Maven repository is really nice. And I really like that Maven has convinced people that using common file structures and conventions is a good thing in project management. But neither of these things requires Maven itself. I tend to want to minimize our 3rd party dependencies, anyway, as much as possible. The simpler we can keep this, the better off we will be.

2. One other good thing from a infrastructure point of view for the sub-project structure is we can, in theory, give permission to a committer on a single algorithm, much like the contrib modules in Lucene. This isn't a big deal, but it could be useful, if someone is really knowledgeable in one particular area and is only contributing in that area. Generally, however, I would favor making someone a full committer.

3. I do like the idea of both separate jars and a single uber-jar. This is trivial to do in both ANT and Maven.

-Grant

On Jan 30, 2008, at 3:21 AM, Ted Dunning wrote:


And all of Colt is < 1M.

I would say that it isn't all that likely that the library will get to more
than a few megs (if that).  At that size, it really doesn't matter that
there is a bit of dross along for the ride.

How many here would rather pick and choose pieces out of rapid miner or
weka? Or would you rather just download the comprehensive jar and be ready
to roll?

I also think that the example of text translation vs spam categorization is a bit of a straw man. It is much more likely that these would be entirely independent applications that would themselves like to download the (single)
Mahout jar.

On 1/29/08 11:45 PM, "Isabel Drost" <[EMAIL PROTECTED]> wrote:

On Wednesday 30 January 2008, Steve Rowe wrote:
On 01/29/2008 at 6:44 PM, Lukas Vlcek wrote:
I would prefer to have an option not to work with whole library but
select only specific algorithms and optionally their particular
modifications.

+1

+1 I would at least like to have one downloadable jar for each algorithm
family (why would I as a user want to download the functionality for
translating texts, if all I want to do is build a better spam classification plugin for spam assassin?) plus one library for the common code like input-/
output-filters.

Maybe we should look at other machine learning frameworks that followed
the "all in one jar" path to get a feeling on how large a project can easily get. Please be careful with these numbers, as both projects are trying to
provide whole machine learning frameworks with GUIs for experimentation,
algorithms for evaluation and the like.

Weka                         Compiled: 4.4M
Rapid Miner Sources: 12M Compiled: 4.5M (21M including all dependencies)

Isabel

Reply via email to