Re: Thinking about Mahout layout, builds, etc.

Ken Montanez Tue, 29 Jan 2008 23:18:39 -0800

Ian, it looks like you have some experience with Maven. I have heard a lot
of good things about it from many people on many different projects, but no
personal experience to speak of - can you vouch for it? What do you like
about it? Is there anything that you don't like/want to get away from?


Ken

On Jan 29, 2008 10:41 PM, Ian Holsman <[EMAIL PROTECTED]> wrote:

> Hi Guys,
>
> my 2c's
>
> Grant Ingersoll wrote:
> > But would people prefer getting the jars separately?  I do think there
> > is some common housekeeping code, but I also don't want to
> > overemphasize it when it comes to developing an individual algorithm.
> > In other words, if a bayes classifier and an SVM implementation could
> > share a common framework, but it would end up being really confusing,
> > versus them each being more or less cleanly separated and logical, I
> > think I would favor the separated.   By the same token, if they can
> > work beautifully together, then, that would argue for more common code.
> >
> are we planning on making separate releases for the code?
> does having these bundled together somehow impact the performance or
> functionality of other algorithms?
> would the combined size of he jar be less than 2-3m?
>
> if the answer to all of these is no, we should just have a single jar.
> Size should not be an issue here, development/operational speed should be.
> It is much easier to manage a single jar operationally IMHO.
>
>
> > As for Hadoop and HBase, that is just two potential libraries.  We are
> > potentially talking 10+.  Would you want to download a huge jar that
> > contains everything when all you want is a single algorithm?  Granted,
> > that can be done from one source tree, but I wonder if that makes it
> > harder.
> yep. but in these days of maven and the like I have no idea how many
> jars I'm actually downloading, it just does it.
>
> >
> > But, I do take away that we probably should just start simple and not
> > worry about a complex build just yet.  I think it is safe to say that
> > up through our first official release we can feel free to change
> > things around if we have to.
> >
>
> yep..
> >
> > -Grant
> >
> >
> > On Jan 29, 2008, at 9:04 PM, Mason Tang wrote:
> >
> >> +1
> >>
> >> Not going to repeat the same arguments, but one other thing is that
> >> almost all of the algorithms are going to (or at least should) share
> >> some common housekeeping code, the main chunk of which will probably
> >> be IO.  Functionally, I don't think an individual algorithm is
> >> significant enough to warrant its own project, and many of them might
> >> wind up sharing common interfaces.
> >>
> >> ~ Mason
> >>
> >> Jeff Eastman wrote:
> >>> +1
> >>> A single project facilitates refactoring and promotes consistency of
> >>> design. If there's not enough code in Hadoop+Hbase to justify multiple
> >>> projects it would be premature abstraction to organize Mahout that
> way.
> >>> Let's keep it simple...
> >>> Jeff
> >>> -----Original Message-----
> >>> From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Tuesday, January
> >>> 29, 2008 4:21 PM
> >>> To: [email protected]
> >>> Subject: Re: Thinking about Mahout layout, builds, etc.
> >>> Initially, developers will be hitting bugs or bad design all over the
> >>> place
> >>> so they would favor one project.  Also, with good package design, you
> >>> get
> >>> most of the benefits of multiple projects.
> >>> So why not start simple and migrate to complicated later?
> >>> On 1/29/08 3:15 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote:
> >>>> Thinking about these alternatives from an Eclipse user's point of
> >>> view,
> >>>> the original proposal would seem to encourage multiple projects (one
> >>> per
> >>>> algorithm + a common project) while the second would encourage a
> >>> single
> >>>> project containing multiple packages. Depending upon the amount of
> >>> code
> >>>> that would reside in each algorithm, one or the other might be
> >>>> preferable.
> >>>>
> >>>> Would a given developer typically be working on the entire library
> >>>> (single project favoring) or just on one or two algorithms (multiple
> >>>> project favoring)?
> >>>>
> >>>> Jeff
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ted Dunning [mailto:[EMAIL PROTECTED]
> >>>> Sent: Tuesday, January 29, 2008 2:43 PM
> >>>> To: [email protected]
> >>>> Subject: Re: Thinking about Mahout layout, builds, etc.
> >>>>
> >>>>
> >>>>
> >>>> I think that having multiple source roots is a pain.  That is what
> >>>> packages
> >>>> are for.
> >>>>
> >>>> I would recommend instead:
> >>>>
> >>>> - at the top level, there should be trunk, tags, releases as is
> >>> typical
> >>>> in
> >>>> an SVN based project.
> >>>>
> >>>> - below trunk and any tag or release there should be:
> >>>>
> >>>>   docs
> >>>>   lib
> >>>>   src/org/apache/mahout
> >>>>
> >>>> Below the source directory, there should be packages common,
> >>> algorithmA,
> >>>> algorithmB and all tests should be locaated near the associated
> >>> source.
> >>>> If it is really desirable to separate tests from normal source (I
> have
> >>>> done
> >>>> it both ways and find having the tests nearby beneficial), then there
> >>>> can be
> >>>> a parallel tree next to src called "test".
> >>>>
> >>>> The target of compilation should be a single jar file.
> >>>>
> >>>>
> >>>> On 1/29/08 2:26 PM, "Grant Ingersoll" <[EMAIL PROTECTED]> wrote:
> >>>>
> >>>>> I am thinking a structure like the following would be useful for
> >>>>> getting started:
> >>>>> mahout/trunk/
> >>>>>   docs
> >>>>>   common/
> >>>>> src/
> >>>>>            main/
> >>>>>            test/
> >>>>>         docs/
> >>>>>         lib/
> >>>>>   algorithmA/
> >>>>>        Similar to common, but for this algorithm
> >>>>>   algB
> >>>>>        ...
> >>>>>    ...
> >>>>>
> >>>>> Where algorithmA, B, etc. are the various libraries we intend to
> >>>>> implement.  We can hold off on creating them until we have some
> code,
> >>>>> but was thinking it would be good to have the general layout in
> mind.
> >>>>>
> >>>>> Of course, this is expandable and changeable.  What do others think?
> >>>>>
> >>>>> On a related note, one of the things we discussed pre-Apache, was
> the
> >>>>> general sense that we shouldn't feel the need to create an all
> >>>>> encompassing framework.  The basic gist of this being that any given
> >>>>> library could be completely independent of the others (with maybe
> the
> >>>>> exception that they share a common library).  My gut says this is
> the
> >>>>> way to get started, but that it may evolve over time once we have
> >>> some
> >>>>> running time together and can start to recognize synergies, such
> that
> >>>>> maybe by the time we get to 1.0 of Mahout there may be more common
> >>>>> code than we originally thought.  The "common" area above can serve
> >>> as
> >>>>> the area for utilities, classes, common Hadoop extensions, etc. that
> >>>>> are shared between the various algorithms, but I would also say
> let's
> >>>>> not try to prematurely optimize across the algorithms just yet.
> >>>>>
> >>>>> Anyone else have any preference on this?
> >>>>>
> >>>>> -Grant
> >>>>>
> >>
> >> --
> >> Mason Tang '10, Course 6-3
> >> Address: Burton-Conner 224A        Email: [EMAIL PROTECTED]
> >>         410 Memorial Dr.          Phone: 508-414-5811
> >>         Cambridge, MA 02139         WWW: www.geekbyday.com
> >
> >
> >
>
>


-- 
Ken Montanez | 510.681.5576

Re: Thinking about Mahout layout, builds, etc.

Reply via email to