On Aug 5, 2009, at 7:52 PM, Allan Roberto Avendano Sudario wrote:

2009/8/5 Grant Ingersoll <[email protected]>

What parameters did you use in the command line?


I'm running syntheticcontrol kmeans clustering. Three parameters are needed:
2 threshold & 1 convergence criteria for iterations.

Which values are recommended to assign to each one?

Synthetic Control is just an example data set. For generic clustering, use the KMeansDriver. AFAICT, setting those values is done by trial and error, but others may have more insight.





There are a couple of threads in the archives that are likely of interest
along these lines:
http://www.lucidimagination.com/search/p:mahout?q=clustering#/
p:mahout/s:email/l:user

Are you trying to cluster text?  Or something else?


Yes, I'm trying to clustering text. I've build a tf-idf matrix compose by sparse vectors. Syntheticcontrol kmeans clustering works well with sparse
vectors?

KMeans works fine w/ Sparse, although you might want to wait for MAHOUT-121 to be resolved, as it has a pretty significant speedup. Should be done in a few days. Either that, or try the patch that is already there.

As I understand it, you will need to match up your L-norm with your distance measure to some extent, but see the archive thread: http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering

http://cwiki.apache.org/MAHOUT/clusteringyourdata.html has some information, but needs to be filled in more.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to