Drew, Yeah, I'm also very interested in clustering and the article is mainly focused on collaborative filtering. I've been reading through mailing list archives and Mahout source code such as KMeansDriver.java to get a better idea on how to get clustering working before I ask the simple "FAQ"-type questions. We'd like to be able to:
a.) cluster b.) classify our data so we can explore the data and project / classify trends. Other data mining tools are a little more turn-key than Mahout, but since we use Hadoop we believe Mahout is the best option at this point. My short term goal is to take known data sets such as: http://www-stat.ucdavis.edu/~shumway/tsa.html and http://lib.stat.cmu.edu/datasets/ and get basic clustering to work on Hadoop with Mahout. If I can get clustering working well and consistently on Mahout, it's a solid stepping stone towards working with other data sets (and any canonical time series data set suggestions are appreciated here). I've worked with timeseries data sets and machine learning before but more with neural nets (and others) in non-distributed situations. Since our project is open source, I'd love to be able to contribute back time series related code to future Mahout versions. Our project's home page is: http://openpdc.codeplex.com Some background: http://jpatterson.floe.tv/index.php/2009/10/29/the-smartgrid-goes-open-s ource/ http://news.cnet.com/8301-13846_3-10393259-62.html?tag=mncol;title http://earth2tech.com/2009/11/10/the-google-android-of-the-smart-grid-op enpdc/ I think in terms of clustering time series data, the first step looks to be vectorizing the input cases with possibly the DenseVector class and feeding that to a basic KMeans implementation like KMeansDriver.java. Once we can get the basic kmeans rolling with some known dataset we'll be able to iterate on that and move towards using more complex techniques and other grid timeseries data. Any suggestions or discussion is greatly appreciated, Josh Patterson TVA -----Original Message----- From: Drew Farris [mailto:[email protected]] Sent: Friday, November 20, 2009 12:04 PM To: [email protected] Subject: Re: mahout examples Hi Josh, I got started with Mahout using Grant's article as an introduction -- the code included with it is a older than the 0.2 release. Ant is used primarilly to build Grant's examples and as a launcher -- it sets up the classpath so that the examples can be run properly. The build script uses the pre-build version of mahout shipped with the bundle. I found that the article itself and the sample code provided was a great way to get started, but I quickly started working from the subversion repository once I had achieved the basics. I was mostly interested in clustering and that wasn't covered in the article to a great extent anyway. The mahout wiki is a great resource for exploring further. For what it's worth the mahout sources included in Grant's article and the sources obtainable as part of the release or from subversion must all be built using maven Hope this helps, Drew On Fri, Nov 20, 2009 at 10:16 AM, Patterson, Josh <[email protected]> wrote: > I've noticed that the article at: > > > > http://www.ibm.com/developerworks/java/library/j-mahout/ > > > > uses Ant while release of Mahout 0.2 uses Maven. Also, the article's > included downloadable code includes what looks like a snapshot version > of Mahout 0.2 - is that essentially a copy of the release? Are there any > other differences I should note when working through these examples? > Thanks! > > > > Josh Patterson > > TVA > >
