Re: 0.1 Planning

Grant Ingersoll Thu, 21 Aug 2008 08:07:20 -0700


On Aug 20, 2008, at 10:10 AM, Karl Wettin wrote:

I think it would be nice to get it out ASAP, perhaps even by nextweekend? I'll get started on the HowToRelease wiki page right now.

Anything is possible, I suppose. I'll do what I can, but I am alsoplanning a Solr release for next week, so...

I also got a bunch of post 0.1 thoughts:
We could post a wishlist/planning for 0.2 in the release of 0.1.This is probably just a link to a currently non existing Wiki pagewhere we list what people are working on that may or may not becomesomething. This could turn out to be a catalysator, and if nothingelse it could be used to help consolidate work taking place outsideof the fora to avoid duplicate work. Or is it better if we filledthe JIRA with that sort of stuff? It would be nice if we did not endup with a thousand old and open issues without patches. Or?

I'm open to anything, but I've always found coordinating O/S projectsto be like the proverbial cat herding problem. I'd love to getHadoop's Patch checker system in place for Mahout on Hudson, I thinkthis can help w/ the bad patch problem. Of course, the flip side tothe thousand old issues is the stale wiki. I don't know a goodsolution, as they all rely on people to be involved and take up thework to maintain. Or perhaps, we can come up w/ a cool Mahoutapplication that we train on JIRA to classify issues into: Good,maybe, and bad and we automatically close/mark any issue that islabeled as bad. :-) Might make for a cool, real world applicationthat would benefit a whole ton of projects in the ASF alone. Argh,where's that cloning machine when you need it? Just not enough hoursin the day.

Also, one way to potenitally get lots of users at release is tointroduce a simple bandade between a Lucene index and Mahout. Noneed to make it as complex as MAHOUT-7, something that converts theterm vector of a document to a SparseVector using term identity ascolumn would be enough. They who don't want the term vectors intheir index could use some layer that pre-analyzed a Document atindex time (and replace the fields with the stream) and passed downthe vectors in some format that makes sense for Mahout.

I think the Bayes stuff has some of this ground work, namely theexamples use Lucene to analyze the articles and put them in the Bayesformat.

I for one is working on MAHOUT-19, using -61 (mbox/nntp->matrix) forexamples and trying to come up with a new take on -65 (meta data)(as -61 can make use of that). I'm also looking closer at cross foldvalidation to power various feature selection schemes, but this is abit secondary.

Cool. Once we get the release out, I plan on building an Amazon AMIfor it and putting up docs on it, as well as start doing some tests,using the new NB/CNB Wikipedia stuff, and maybe also setting up anexample using DMOZ or something like that as a POC.


I would also love to get in a SVM implementation for 0.2.

20 aug 2008 kl. 14.59 skrev Grant Ingersoll:
Hi Mahouters,
I'd like to suggest we start gearing up for a 0.1 release. Sincethis is our first one, we're going to have a bit of extra work toget things in the right shape, so any extra time you have would bemost appreciated.
First and foremost, would be testing, etc. on the current trunk(assuming SVN is up, which it doesn't appear to be right now) andproviding feedback on what's good and bad. This is especially trueof people who have access to clusters (which many of us committerswill soon have thanks to a kind donation by Amazon.)
Second, we should go through JIRA and (un)mark issues in JIRA aseither in or out of 0.1 or closed. See https://issues.apache.org/jira/browse/MAHOUT/fixforversion/12312976Of these, MAHOUT-9, 56 and 60 are all pretty much done, they justneed a bit more documentation. M-54 looks like it could be closed,right Jeff, as the reporter hasn't responded to questions, etc.?So, if you have something you think should be in 0.1, please gomark it as such in JIRA.
Next, we need to address https://issues.apache.org/jira/browse/MAHOUT-69, at a minimum. One of us should look at other ASF projects(Lucene/Solr) and grab their "How To Make a Release" documentation(on the wiki) and put it up on our wiki. Volunteers?
After that, I'd suggest we are ready for a release. Typically, wecall a "freeze" date, and then we release a series of releasecandidates. For Mahout, since we are so young and this is such anearly release, I don't think we need to obsess too much over this.Our APIs are likely to change in the future, so we should just keepthings light: release early, release often. I volunteer to be therelease manager.
With the release ready to go, then we can go out and make somenoise, to help attract more people, etc. We can work w/ the ASFPRC (public relations committee) on this a bit, I think.Additionally, those of us who blog should do so. I'd also think itwould be great if anyone with Wikipedia savviness could put us onthe map there. Currently, Wikipedia Mahout is: http://en.wikipedia.org/wiki/Mahoutbut I think we could make it a "disambiguation" page, or at leastadd in an Apache Mahout page. Just food for thought... Ourcommunity is actually pretty big for a new project, or at least thenumber of lurkers is pretty big. I think a number of people are in"wait and see" mode, so we (i.e. committers and activecontributors) need to get over the hump a bit so that others willfeel more comfortable joining in. An official release should helpwith that, but do let us know if you have other ideas as well.
Time wise, I'd love it if we could have the release out within themonth, but of course, I know we are all busy. That being said,we've got a lot of goodness in our repo now, what w/ Taste,Clustering, the GA stuff and the Naive Bayes stuff (kudos to ourtwo active GSOC students Deneche and Robin!)
Cheers,
Grant


--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: 0.1 Planning

Reply via email to