Hi Guys,

my 2c's

Grant Ingersoll wrote:
But would people prefer getting the jars separately? I do think there is some common housekeeping code, but I also don't want to overemphasize it when it comes to developing an individual algorithm. In other words, if a bayes classifier and an SVM implementation could share a common framework, but it would end up being really confusing, versus them each being more or less cleanly separated and logical, I think I would favor the separated. By the same token, if they can work beautifully together, then, that would argue for more common code.

are we planning on making separate releases for the code?
does having these bundled together somehow impact the performance or functionality of other algorithms?
would the combined size of he jar be less than 2-3m?

if the answer to all of these is no, we should just have a single jar. Size should not be an issue here, development/operational speed should be.
It is much easier to manage a single jar operationally IMHO.


As for Hadoop and HBase, that is just two potential libraries. We are potentially talking 10+. Would you want to download a huge jar that contains everything when all you want is a single algorithm? Granted, that can be done from one source tree, but I wonder if that makes it harder.
yep. but in these days of maven and the like I have no idea how many jars I'm actually downloading, it just does it.


But, I do take away that we probably should just start simple and not worry about a complex build just yet. I think it is safe to say that up through our first official release we can feel free to change things around if we have to.


yep..

-Grant


On Jan 29, 2008, at 9:04 PM, Mason Tang wrote:

+1

Not going to repeat the same arguments, but one other thing is that almost all of the algorithms are going to (or at least should) share some common housekeeping code, the main chunk of which will probably be IO. Functionally, I don't think an individual algorithm is significant enough to warrant its own project, and many of them might wind up sharing common interfaces.

~ Mason

Jeff Eastman wrote:
+1
A single project facilitates refactoring and promotes consistency of
design. If there's not enough code in Hadoop+Hbase to justify multiple
projects it would be premature abstraction to organize Mahout that way.
Let's keep it simple...
Jeff
-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 29, 2008 4:21 PM
To: [email protected]
Subject: Re: Thinking about Mahout layout, builds, etc.
Initially, developers will be hitting bugs or bad design all over the
place
so they would favor one project.  Also, with good package design, you
get
most of the benefits of multiple projects.
So why not start simple and migrate to complicated later?
On 1/29/08 3:15 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote:
Thinking about these alternatives from an Eclipse user's point of
view,
the original proposal would seem to encourage multiple projects (one
per
algorithm + a common project) while the second would encourage a
single
project containing multiple packages. Depending upon the amount of
code
that would reside in each algorithm, one or the other might be
preferable.

Would a given developer typically be working on the entire library
(single project favoring) or just on one or two algorithms (multiple
project favoring)?

Jeff

-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 29, 2008 2:43 PM
To: [email protected]
Subject: Re: Thinking about Mahout layout, builds, etc.



I think that having multiple source roots is a pain.  That is what
packages
are for.

I would recommend instead:

- at the top level, there should be trunk, tags, releases as is
typical
in
an SVN based project.

- below trunk and any tag or release there should be:

  docs
  lib
  src/org/apache/mahout

Below the source directory, there should be packages common,
algorithmA,
algorithmB and all tests should be locaated near the associated
source.
If it is really desirable to separate tests from normal source (I have
done
it both ways and find having the tests nearby beneficial), then there
can be
a parallel tree next to src called "test".

The target of compilation should be a single jar file.


On 1/29/08 2:26 PM, "Grant Ingersoll" <[EMAIL PROTECTED]> wrote:

I am thinking a structure like the following would be useful for
getting started:
mahout/trunk/
  docs
  common/
src/
           main/
           test/
        docs/
        lib/
  algorithmA/
       Similar to common, but for this algorithm
  algB
       ...
   ...

Where algorithmA, B, etc. are the various libraries we intend to
implement.  We can hold off on creating them until we have some code,
but was thinking it would be good to have the general layout in mind.

Of course, this is expandable and changeable.  What do others think?

On a related note, one of the things we discussed pre-Apache, was the
general sense that we shouldn't feel the need to create an all
encompassing framework.  The basic gist of this being that any given
library could be completely independent of the others (with maybe the
exception that they share a common library).  My gut says this is the
way to get started, but that it may evolve over time once we have
some
running time together and can start to recognize synergies, such that
maybe by the time we get to 1.0 of Mahout there may be more common
code than we originally thought.  The "common" area above can serve
as
the area for utilities, classes, common Hadoop extensions, etc. that
are shared between the various algorithms, but I would also say let's
not try to prematurely optimize across the algorithms just yet.

Anyone else have any preference on this?

-Grant


--
Mason Tang '10, Course 6-3
Address: Burton-Conner 224A        Email: [EMAIL PROTECTED]
        410 Memorial Dr.          Phone: 508-414-5811
        Cambridge, MA 02139         WWW: www.geekbyday.com




Reply via email to