On Aug 20, 2008, at 10:10 AM, Karl Wettin wrote:
I think it would be nice to get it out ASAP, perhaps even by next
weekend? I'll get started on the HowToRelease wiki page right now.
Anything is possible, I suppose. I'll do what I can, but I am also
planning a Solr release for next week, so...
I also got a bunch of post 0.1 thoughts:
We could post a wishlist/planning for 0.2 in the release of 0.1.
This is probably just a link to a currently non existing Wiki page
where we list what people are working on that may or may not become
something. This could turn out to be a catalysator, and if nothing
else it could be used to help consolidate work taking place outside
of the fora to avoid duplicate work. Or is it better if we filled
the JIRA with that sort of stuff? It would be nice if we did not end
up with a thousand old and open issues without patches. Or?
I'm open to anything, but I've always found coordinating O/S projects
to be like the proverbial cat herding problem. I'd love to get
Hadoop's Patch checker system in place for Mahout on Hudson, I think
this can help w/ the bad patch problem. Of course, the flip side to
the thousand old issues is the stale wiki. I don't know a good
solution, as they all rely on people to be involved and take up the
work to maintain. Or perhaps, we can come up w/ a cool Mahout
application that we train on JIRA to classify issues into: Good,
maybe, and bad and we automatically close/mark any issue that is
labeled as bad. :-) Might make for a cool, real world application
that would benefit a whole ton of projects in the ASF alone. Argh,
where's that cloning machine when you need it? Just not enough hours
in the day.
Also, one way to potenitally get lots of users at release is to
introduce a simple bandade between a Lucene index and Mahout. No
need to make it as complex as MAHOUT-7, something that converts the
term vector of a document to a SparseVector using term identity as
column would be enough. They who don't want the term vectors in
their index could use some layer that pre-analyzed a Document at
index time (and replace the fields with the stream) and passed down
the vectors in some format that makes sense for Mahout.
I think the Bayes stuff has some of this ground work, namely the
examples use Lucene to analyze the articles and put them in the Bayes
format.
I for one is working on MAHOUT-19, using -61 (mbox/nntp->matrix) for
examples and trying to come up with a new take on -65 (meta data)
(as -61 can make use of that). I'm also looking closer at cross fold
validation to power various feature selection schemes, but this is a
bit secondary.
Cool. Once we get the release out, I plan on building an Amazon AMI
for it and putting up docs on it, as well as start doing some tests,
using the new NB/CNB Wikipedia stuff, and maybe also setting up an
example using DMOZ or something like that as a POC.
I would also love to get in a SVM implementation for 0.2.
20 aug 2008 kl. 14.59 skrev Grant Ingersoll:
Hi Mahouters,
I'd like to suggest we start gearing up for a 0.1 release. Since
this is our first one, we're going to have a bit of extra work to
get things in the right shape, so any extra time you have would be
most appreciated.
First and foremost, would be testing, etc. on the current trunk
(assuming SVN is up, which it doesn't appear to be right now) and
providing feedback on what's good and bad. This is especially true
of people who have access to clusters (which many of us committers
will soon have thanks to a kind donation by Amazon.)
Second, we should go through JIRA and (un)mark issues in JIRA as
either in or out of 0.1 or closed. See https://issues.apache.org/jira/browse/MAHOUT/fixforversion/12312976
Of these, MAHOUT-9, 56 and 60 are all pretty much done, they just
need a bit more documentation. M-54 looks like it could be closed,
right Jeff, as the reporter hasn't responded to questions, etc.?
So, if you have something you think should be in 0.1, please go
mark it as such in JIRA.
Next, we need to address https://issues.apache.org/jira/browse/MAHOUT-69
, at a minimum. One of us should look at other ASF projects
(Lucene/Solr) and grab their "How To Make a Release" documentation
(on the wiki) and put it up on our wiki. Volunteers?
After that, I'd suggest we are ready for a release. Typically, we
call a "freeze" date, and then we release a series of release
candidates. For Mahout, since we are so young and this is such an
early release, I don't think we need to obsess too much over this.
Our APIs are likely to change in the future, so we should just keep
things light: release early, release often. I volunteer to be the
release manager.
With the release ready to go, then we can go out and make some
noise, to help attract more people, etc. We can work w/ the ASF
PRC (public relations committee) on this a bit, I think.
Additionally, those of us who blog should do so. I'd also think it
would be great if anyone with Wikipedia savviness could put us on
the map there. Currently, Wikipedia Mahout is: http://en.wikipedia.org/wiki/Mahout
but I think we could make it a "disambiguation" page, or at least
add in an Apache Mahout page. Just food for thought... Our
community is actually pretty big for a new project, or at least the
number of lurkers is pretty big. I think a number of people are in
"wait and see" mode, so we (i.e. committers and active
contributors) need to get over the hump a bit so that others will
feel more comfortable joining in. An official release should help
with that, but do let us know if you have other ideas as well.
Time wise, I'd love it if we could have the release out within the
month, but of course, I know we are all busy. That being said,
we've got a lot of goodness in our repo now, what w/ Taste,
Clustering, the GA stuff and the Naive Bayes stuff (kudos to our
two active GSOC students Deneche and Robin!)
Cheers,
Grant
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ