On Jun 26, 2009, at 7:45 PM, Jeff Eastman wrote:
Found the call in the syntheticcontrol/kmeans.Job had true for the
overwrite output flag. Don't think that was your problem, but
something similar must be at work.
Oh, duh, that true flag is after canopy, but before kmeans. D'oh.
Good catch.
Jeff Eastman wrote:
Running the latest trunk, I get a file not found exception running
synthetic control on the $output/data file. Looks like output got
deleted somewhere but have not discovered where yet. Perhaps Canopy
is broken or KMeans is purging output?
Grant Ingersoll wrote:
I'm running trunk. Using the data at http://people.apache.org/wikipedia/n2.tar.gz
(a dump of 2302 documents from a Lucene index of Wikipedia. The
chunks file in that same directory contains the original files).
Vectors are normalized using L2.
When I run K-Means on it via:
org.apache.mahout.clustering.kmeans.KMeansDriver --input /Users/
grantingersoll/projects/lucene/solr/wikipedia/devWorks/n2/part-
full.txt --clusters /Users/grantingersoll/projects/lucene/solr/
wikipedia/devWorks/n2/clusters --k 10 --output /Users/
grantingersoll/projects/lucene/solr/wikipedia/devWorks/n2/k-output
--distance org.apache.mahout.utils.CosineDistanceMeasure
I get the two directories seen in n2-output. The clusters-0 and
clusters-1 files both contain a single vector which is all 0.
I've also tried SquaredEuclidean, but to no avail.
Any insight into what I'm doing wrong would be appreciated.
Thanks,
Grant
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search