Running the latest trunk, I get a file not found exception running
synthetic control on the $output/data file. Looks like output got
deleted somewhere but have not discovered where yet. Perhaps Canopy is
broken or KMeans is purging output?
Grant Ingersoll wrote:
I'm running trunk. Using the data at
http://people.apache.org/wikipedia/n2.tar.gz (a dump of 2302 documents
from a Lucene index of Wikipedia. The chunks file in that same
directory contains the original files). Vectors are normalized using L2.
When I run K-Means on it via:
org.apache.mahout.clustering.kmeans.KMeansDriver --input
/Users/grantingersoll/projects/lucene/solr/wikipedia/devWorks/n2/part-full.txt
--clusters
/Users/grantingersoll/projects/lucene/solr/wikipedia/devWorks/n2/clusters
--k 10 --output
/Users/grantingersoll/projects/lucene/solr/wikipedia/devWorks/n2/k-output
--distance org.apache.mahout.utils.CosineDistanceMeasure
I get the two directories seen in n2-output. The clusters-0 and
clusters-1 files both contain a single vector which is all 0.
I've also tried SquaredEuclidean, but to no avail.
Any insight into what I'm doing wrong would be appreciated.
Thanks,
Grant