Hi,

I've been going over the kmeans stuff the last few days to try and understand how it works, and how I might extend it to work with the data I'm looking to process. It's taken me a while to get a basic understanding of things, and really appreciate having lists like this around for support.

I need to be able to label the vectors: each vector holds (for a document) a set of similarity scores across a number of attributes. I did some searching around payloads (after coming across the term in some comments) but couldn't see how I add a payload to the Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65 ) that mentions the addition of the setName method to Vector. I've tried building trunk, and although there were a few test failures for other (seemingly unrelated) examples I continued and managed to get the mahout-examples jar/job files built to give it a whirl.

When I run the following:

$ hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

I see it run the "Preparing Input", "Running Canopy to get initial clusters", and then finally it starts "Running KMeans". But, shortly after it breaks with the following trace:

---snip---
Running KMeans
09/07/13 23:49:34 INFO kmeans.KMeansDriver: Input: output/data Clusters In: output/canopies Out: output Distance: org.apache.mahout.utils.EuclideanDistanceMeasure 09/07/13 23:49:34 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: 1 Input Vectors: org.apache.mahout.matrix.SparseVector
09/07/13 23:49:34 INFO kmeans.KMeansDriver: Iteration 0
09/07/13 23:49:34 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/07/13 23:49:34 INFO mapred.FileInputFormat: Total input paths to process : 2 09/07/13 23:49:34 INFO mapred.JobClient: Running job: job_200907132019_0040
09/07/13 23:49:35 INFO mapred.JobClient:  map 0% reduce 0%
09/07/13 23:49:42 INFO mapred.JobClient:  map 50% reduce 0%
09/07/13 23:49:43 INFO mapred.JobClient:  map 100% reduce 0%
09/07/13 23:49:49 INFO mapred.JobClient:  map 100% reduce 100%
09/07/13 23:49:50 INFO mapred.JobClient: Job complete: job_200907132019_0040
09/07/13 23:49:50 INFO mapred.JobClient: Counters: 16
09/07/13 23:49:50 INFO mapred.JobClient:   File Systems
09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes read=465629
09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes written=5631
09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes read=7806
09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes written=15674
09/07/13 23:49:50 INFO mapred.JobClient:   Job Counters
09/07/13 23:49:50 INFO mapred.JobClient:     Launched reduce tasks=1
09/07/13 23:49:50 INFO mapred.JobClient:     Launched map tasks=2
09/07/13 23:49:50 INFO mapred.JobClient:     Data-local map tasks=2
09/07/13 23:49:50 INFO mapred.JobClient:   Map-Reduce Framework
09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input groups=7
09/07/13 23:49:50 INFO mapred.JobClient:     Combine output records=10
09/07/13 23:49:50 INFO mapred.JobClient:     Map input records=600
09/07/13 23:49:50 INFO mapred.JobClient:     Reduce output records=7
09/07/13 23:49:50 INFO mapred.JobClient:     Map output bytes=465600
09/07/13 23:49:50 INFO mapred.JobClient:     Map input bytes=448580
09/07/13 23:49:50 INFO mapred.JobClient:     Combine input records=600
09/07/13 23:49:50 INFO mapred.JobClient:     Map output records=600
09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input records=10
09/07/13 23:49:50 WARN kmeans.KMeansDriver: java.io.IOException: Cannot open filename /user/paul/output/clusters-0/_logs java.io.IOException: Cannot open filename /user/paul/output/clusters-0/ _logs at org.apache.hadoop.hdfs.DFSClient $DFSInputStream.openInfo(DFSClient.java:1394) at org.apache.hadoop.hdfs.DFSClient $DFSInputStream.<init>(DFSClient.java:1385)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:338)
at org .apache .hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:171) at org.apache.hadoop.io.SequenceFile $Reader.openFile(SequenceFile.java:1437) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java: 1424) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java: 1417) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java: 1412) at org .apache .mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java: 304) at org .apache .mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java: 241) at org .apache.mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java: 194) at org .apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java: 100) at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 56)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun .reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 39) at sun .reflect .DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
---snip---

This is against revision 793689, running on my development Mac Pro (pseudo-distributed single node) with Hadoop 0.19.1.

It's a bit late to be digging through what's going on, but will try and take a look tomorrow- really excited about giving kmeans a whirl on the document processing I'm playing with. In the meantime, I was wondering whether anyone else had seen the same, or knew a way to accomplish something similar with the released version (or point me to a past good revision perhaps?)

Thanks again,
Paul

Reply via email to