On Aug 5, 2009, at 7:52 PM, Allan Roberto Avendano Sudario wrote:
2009/8/5 Grant Ingersoll <[email protected]>
What parameters did you use in the command line?
I'm running syntheticcontrol kmeans clustering. Three parameters are
needed:
2 threshold & 1 convergence criteria for iterations.
Which values are recommended to assign to each one?
Synthetic Control is just an example data set. For generic
clustering, use the KMeansDriver. AFAICT, setting those values is
done by trial and error, but others may have more insight.
There are a couple of threads in the archives that are likely of
interest
along these lines:
http://www.lucidimagination.com/search/p:mahout?q=clustering#/
p:mahout/s:email/l:user
Are you trying to cluster text? Or something else?
Yes, I'm trying to clustering text. I've build a tf-idf matrix
compose by
sparse vectors. Syntheticcontrol kmeans clustering works well with
sparse
vectors?
KMeans works fine w/ Sparse, although you might want to wait for
MAHOUT-121 to be resolved, as it has a pretty significant speedup.
Should be done in a few days. Either that, or try the patch that is
already there.
As I understand it, you will need to match up your L-norm with your
distance measure to some extent, but see the archive thread: http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
http://cwiki.apache.org/MAHOUT/clusteringyourdata.html has some
information, but needs to be filled in more.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search