I looked at the Reuters example in MiA and it has not yet been updated to reflect recent changes in the file nomenclature in trunk. It was actually incorrect for 0.3 too, as it shows the contents of reuters-vectors after seq2sparse to be (on p132):

$ls reuters-vectors/
dictionary.file-0
tfidf/
tokenized-documents/
vectors/
wordcount/

but then (on p144) it gives the input argument to k-means as:

-i reuters-vectors

which should have been:

-i reuters-vectors/tfidf (and maybe also /vectors after that, iirc, its been a few months since it was changed)

As noted below, the current nomenclature after seq2sparse is:

ls reuters-out-seqdir-sparse/
df-count/
frequency.file-0
tfidf-vectors/
wordcount/
dictionary.file-0
tf-vectors/
tokenized-documents/

We will need to get the book examples and the code in synch with whichever release coincides with its final publication. Both are moving targets right now. Given the rate of change of Mahout we always recommend using trunk and the trunk examples are most likely to work.

On 10/5/10 6:24 PM, Jeff Eastman wrote:
The random seed generator can't read the parts in the input folder "reuters-vectors". What is in that directory? The program is expecting part files containing VectorWritable points. If you ran examples/bin/build-reuters.sh then the input to k-means (see the script) should be:

-i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/

I suggest running the script with the k-means clustering uncommented before getting outside of the standard file nomenclature.
Jeff


On 10/5/10 4:17 PM, Chris Bush wrote:
Trying the kmeans clustering on reuters example data (Reuters-21578 news
collection) as covered in Mahout In Action, the following stack trace occurs
immediately (with and without HADOOP_HOME set -- with it set, the no
HADOOP_HOME warning is omitted) :

$ bin/mahout kmeans -i reuters-vectors -c reuters-initial-clusters -o
reuters-kmeans-clusters -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -r 1 -cd
1.0 -k 20 -x 10

no HADOOP_HOME set, running locally
Oct 5, 2010 2:27:28 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=reuters-vectors, --maxIter=10, --maxRed=1,
--method=mapreduce, --numClusters=20, --output=reuters-kmeans-clusters,
--startPhase=0, --tempDir=temp}
Oct 5, 2010 2:27:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting reuters-initial-clusters
Oct 5, 2010 2:27:29 PM org.apache.hadoop.util.NativeCodeLoader<clinit>
WARNING: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
Oct 5, 2010 2:27:29 PM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Exception in thread "main" java.lang.ClassCastException: class
org.apache.hadoop.io.IntWritable
at java.lang.Class.asSubclass(Class.java:3018)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:86)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:139)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
$

The org.apache.mahout.clustering.kmeans.RandomSeedGenerator class casts the key from SequenceFile.Reader as a org.apache.hadoop.io.Writable successfully but then tries to cast the value as org.apache.mahout.math.VectorWritable
unsuccessfully.

Thanks,

Chris



Reply via email to