We need to make that handled separately then from the various jobs. That was one of the things that was different about the KMeansJob call.

On Jun 26, 2009, at 7:45 PM, Jeff Eastman wrote:

Found the call in the syntheticcontrol/kmeans.Job had true for the overwrite output flag. Don't think that was your problem, but something similar must be at work.



Jeff Eastman wrote:
Running the latest trunk, I get a file not found exception running synthetic control on the $output/data file. Looks like output got deleted somewhere but have not discovered where yet. Perhaps Canopy is broken or KMeans is purging output?


Grant Ingersoll wrote:
I'm running trunk. Using the data at http://people.apache.org/wikipedia/n2.tar.gz (a dump of 2302 documents from a Lucene index of Wikipedia. The chunks file in that same directory contains the original files). Vectors are normalized using L2.

When I run K-Means on it via: org.apache.mahout.clustering.kmeans.KMeansDriver --input /Users/ grantingersoll/projects/lucene/solr/wikipedia/devWorks/n2/part- full.txt --clusters /Users/grantingersoll/projects/lucene/solr/ wikipedia/devWorks/n2/clusters --k 10 --output /Users/ grantingersoll/projects/lucene/solr/wikipedia/devWorks/n2/k-output --distance org.apache.mahout.utils.CosineDistanceMeasure

I get the two directories seen in n2-output. The clusters-0 and clusters-1 files both contain a single vector which is all 0.

I've also tried SquaredEuclidean, but to no avail.

Any insight into what I'm doing wrong would be appreciated.

Thanks,
Grant







--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to