Drew,
Yeah, I'm also very interested in clustering and the article is mainly
focused on collaborative filtering. I've been reading through mailing
list archives and Mahout source code such as KMeansDriver.java to get a
better idea on how to get clustering working before I ask the simple
"FAQ"-type questions. We'd like to be able to: 

a.) cluster
b.) classify

our data so we can explore the data and project / classify trends. Other
data mining tools are a little more turn-key than Mahout, but since we
use Hadoop we believe Mahout is the best option at this point. My short
term goal is to take known data sets such as:

http://www-stat.ucdavis.edu/~shumway/tsa.html
and
http://lib.stat.cmu.edu/datasets/

and get basic clustering to work on Hadoop with Mahout. If I can get
clustering working well and consistently on Mahout, it's a solid
stepping stone towards working with other data sets (and any canonical
time series data set suggestions are appreciated here). I've worked with
timeseries data sets and machine learning before but more with neural
nets (and others) in non-distributed situations. Since our project is
open source, I'd love to be able to contribute back time series related
code to future Mahout versions. Our project's home page is:

http://openpdc.codeplex.com

Some background:

http://jpatterson.floe.tv/index.php/2009/10/29/the-smartgrid-goes-open-s
ource/
http://news.cnet.com/8301-13846_3-10393259-62.html?tag=mncol;title
http://earth2tech.com/2009/11/10/the-google-android-of-the-smart-grid-op
enpdc/

I think in terms of clustering time series data, the first step looks to
be vectorizing the input cases with possibly the DenseVector class and
feeding that to a basic KMeans implementation like KMeansDriver.java.
Once we can get the basic kmeans rolling with some known dataset we'll
be able to iterate on that and move towards using more complex
techniques and other grid timeseries data. Any suggestions or discussion
is greatly appreciated,

Josh Patterson
TVA

-----Original Message-----
From: Drew Farris [mailto:[email protected]] 
Sent: Friday, November 20, 2009 12:04 PM
To: [email protected]
Subject: Re: mahout examples

Hi Josh,

I got started with Mahout using Grant's article as an introduction --
the code included with it is a older than the 0.2 release.

Ant is used primarilly to build Grant's examples and as a launcher --
it sets up the classpath so that the examples can be run properly. The
build script uses the pre-build version of mahout shipped with the
bundle.

I found that the article itself and the sample code provided was a
great way to get started, but I quickly started working from the
subversion repository once I had achieved the basics. I was mostly
interested in clustering and that wasn't covered in the article to a
great extent anyway. The mahout wiki is a great resource for exploring
further.

For what it's worth the mahout sources included in Grant's article and
the sources obtainable as part of the release or from subversion must
all be built using maven

Hope this helps,

Drew

On Fri, Nov 20, 2009 at 10:16 AM, Patterson, Josh <[email protected]>
wrote:
> I've noticed that the article at:
>
>
>
> http://www.ibm.com/developerworks/java/library/j-mahout/
>
>
>
> uses Ant while release of Mahout 0.2 uses Maven. Also, the article's
> included downloadable code includes what looks like a snapshot version
> of Mahout 0.2 - is that essentially a copy of the release? Are there
any
> other differences I should note when working through these examples?
> Thanks!
>
>
>
> Josh Patterson
>
> TVA
>
>

Reply via email to