Re: Thinking about Mahout layout, builds, etc.

Ian Holsman Tue, 29 Jan 2008 22:41:49 -0800

Hi Guys,

my 2c's


Grant Ingersoll wrote:

But would people prefer getting the jars separately? I do think thereis some common housekeeping code, but I also don't want tooveremphasize it when it comes to developing an individual algorithm.In other words, if a bayes classifier and an SVM implementation couldshare a common framework, but it would end up being really confusing,versus them each being more or less cleanly separated and logical, Ithink I would favor the separated. By the same token, if they canwork beautifully together, then, that would argue for more common code.

are we planning on making separate releases for the code?

does having these bundled together somehow impact the performance orfunctionality of other algorithms?

would the combined size of he jar be less than 2-3m?

if the answer to all of these is no, we should just have a single jar.Size should not be an issue here, development/operational speed should be.

It is much easier to manage a single jar operationally IMHO.

As for Hadoop and HBase, that is just two potential libraries. We arepotentially talking 10+. Would you want to download a huge jar thatcontains everything when all you want is a single algorithm? Granted,that can be done from one source tree, but I wonder if that makes itharder.

yep. but in these days of maven and the like I have no idea how manyjars I'm actually downloading, it just does it.

But, I do take away that we probably should just start simple and notworry about a complex build just yet. I think it is safe to say thatup through our first official release we can feel free to changethings around if we have to.


yep..


-Grant


On Jan 29, 2008, at 9:04 PM, Mason Tang wrote:

+1

Not going to repeat the same arguments, but one other thing is thatalmost all of the algorithms are going to (or at least should) sharesome common housekeeping code, the main chunk of which will probablybe IO. Functionally, I don't think an individual algorithm issignificant enough to warrant its own project, and many of them mightwind up sharing common interfaces.


~ Mason

Jeff Eastman wrote:

+1
A single project facilitates refactoring and promotes consistency of
design. If there's not enough code in Hadoop+Hbase to justify multiple
projects it would be premature abstraction to organize Mahout that way.
Let's keep it simple...
Jeff
-----Original Message-----

From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Tuesday, January29, 2008 4:21 PM

To: [email protected]
Subject: Re: Thinking about Mahout layout, builds, etc.
Initially, developers will be hitting bugs or bad design all over the
place
so they would favor one project.  Also, with good package design, you
get
most of the benefits of multiple projects.
So why not start simple and migrate to complicated later?
On 1/29/08 3:15 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote:

Thinking about these alternatives from an Eclipse user's point of

view,

the original proposal would seem to encourage multiple projects (one

per

algorithm + a common project) while the second would encourage a

single

project containing multiple packages. Depending upon the amount of

code

that would reside in each algorithm, one or the other might be
preferable.

Would a given developer typically be working on the entire library
(single project favoring) or just on one or two algorithms (multiple
project favoring)?

Jeff

-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 29, 2008 2:43 PM
To: [email protected]
Subject: Re: Thinking about Mahout layout, builds, etc.



I think that having multiple source roots is a pain.  That is what
packages
are for.

I would recommend instead:

- at the top level, there should be trunk, tags, releases as is

typical

in
an SVN based project.

- below trunk and any tag or release there should be:

  docs
  lib
  src/org/apache/mahout

Below the source directory, there should be packages common,

algorithmA,

algorithmB and all tests should be locaated near the associated

source.

If it is really desirable to separate tests from normal source (I have
done
it both ways and find having the tests nearby beneficial), then there
can be
a parallel tree next to src called "test".

The target of compilation should be a single jar file.


On 1/29/08 2:26 PM, "Grant Ingersoll" <[EMAIL PROTECTED]> wrote:

I am thinking a structure like the following would be useful for
getting started:
mahout/trunk/
  docs
  common/
src/
           main/
           test/
        docs/
        lib/
  algorithmA/
       Similar to common, but for this algorithm
  algB
       ...
   ...

Where algorithmA, B, etc. are the various libraries we intend to
implement.  We can hold off on creating them until we have some code,
but was thinking it would be good to have the general layout in mind.

Of course, this is expandable and changeable.  What do others think?

On a related note, one of the things we discussed pre-Apache, was the
general sense that we shouldn't feel the need to create an all
encompassing framework.  The basic gist of this being that any given
library could be completely independent of the others (with maybe the
exception that they share a common library).  My gut says this is the
way to get started, but that it may evolve over time once we have

some

running time together and can start to recognize synergies, such that
maybe by the time we get to 1.0 of Mahout there may be more common
code than we originally thought.  The "common" area above can serve

as

the area for utilities, classes, common Hadoop extensions, etc. that
are shared between the various algorithms, but I would also say let's
not try to prematurely optimize across the algorithms just yet.

Anyone else have any preference on this?

-Grant


--
Mason Tang '10, Course 6-3
Address: Burton-Conner 224A        Email: [EMAIL PROTECTED]
        410 Memorial Dr.          Phone: 508-414-5811
        Cambridge, MA 02139         WWW: www.geekbyday.com

Re: Thinking about Mahout layout, builds, etc.

Reply via email to