First off, great feedback from everyone on the article!  It's sometimes funny 
what people take from something you present.  I saw one review saying I was too 
thin on collab filtering and too heavy on clustering, while the consensus here 
seems to be the opposite!  The important thing is that y'all read it!  For 
that, I am grateful.

As for Ant v. Mvn, I went back and forth on it.  In the end, I chose Ant, b/c I 
needed to build custom targets quickly that executed a variety of steps.  Mvn 
is not strong at this.  In fact, you'll see in the build file I have a section 
that updates SVN from Mahout and then invokes Maven!

The code is indeed pre-0.2.  I was, in fact, late on the article delivery b/c I 
had to do a fair amount of work in Mahout to get it up to what I deemed article 
quality (consistent Driver programs, the ability to create Vectors, etc.)  You 
should definitely be using the 0.2 release now that it is out.  There were also 
some things I had hoped to get in for the article and the release that I simply 
couldn't finish in time (MAHOUT-165 and also the log-likelihood cluster 
labeling stuff, M-163, I think).

Some more thoughts, Josh, about your specific issues below.

On Nov 20, 2009, at 2:56 PM, Patterson, Josh wrote:

> Drew,
> Yeah, I'm also very interested in clustering and the article is mainly
> focused on collaborative filtering. I've been reading through mailing
> list archives and Mahout source code such as KMeansDriver.java to get a
> better idea on how to get clustering working before I ask the simple
> "FAQ"-type questions. We'd like to be able to: 
> 
> a.) cluster
> b.) classify
> 
> our data so we can explore the data and project / classify trends. Other
> data mining tools are a little more turn-key than Mahout, but since we
> use Hadoop we believe Mahout is the best option at this point. My short
> term goal is to take known data sets such as:
> 
> http://www-stat.ucdavis.edu/~shumway/tsa.html
> and
> http://lib.stat.cmu.edu/datasets/
> 
> and get basic clustering to work on Hadoop with Mahout.

It should be fairly straightforward to get basics working.  In fact, one of the 
examples on the wiki is for the use case of control systems (albeit synthetic 
ones), but I'll defer to others for deeper insight.

> If I can get
> clustering working well and consistently on Mahout, it's a solid
> stepping stone towards working with other data sets (and any canonical
> time series data set suggestions are appreciated here). I've worked with
> timeseries data sets and machine learning before but more with neural
> nets (and others) in non-distributed situations. Since our project is
> open source, I'd love to be able to contribute back time series related
> code to future Mahout versions.

+1.  That would be great!  I know David Hall was doing some topics over time 
stuff, but haven't seen him around lately to know for sure.  That was related 
to his LDA contribution, so it may not be applicable for you.

> Our project's home page is:
> 
> http://openpdc.codeplex.com
> 
> Some background:
> 
> http://jpatterson.floe.tv/index.php/2009/10/29/the-smartgrid-goes-open-s
> ource/
> http://news.cnet.com/8301-13846_3-10393259-62.html?tag=mncol;title
> http://earth2tech.com/2009/11/10/the-google-android-of-the-smart-grid-op
> enpdc/
> 
> I think in terms of clustering time series data, the first step looks to
> be vectorizing the input cases with possibly the DenseVector class and
> feeding that to a basic KMeans implementation like KMeansDriver.java.

Yep.

> Once we can get the basic kmeans rolling with some known dataset we'll
> be able to iterate on that and move towards using more complex
> techniques and other grid timeseries data. Any suggestions or discussion
> is greatly appreciated,

I think it is pretty wide open here, so I'd suggest focusing on specific 
questions as they arise and we can help at that point.


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to