Error with KMeans example in trunk (793689)

Paul Ingles Mon, 13 Jul 2009 18:40:18 -0700

Hi,

I've been going over the kmeans stuff the last few days to try andunderstand how it works, and how I might extend it to work with thedata I'm looking to process. It's taken me a while to get a basicunderstanding of things, and really appreciate having lists like thisaround for support.

I need to be able to label the vectors: each vector holds (for adocument) a set of similarity scores across a number of attributes. Idid some searching around payloads (after coming across the term insome comments) but couldn't see how I add a payload to the Vector. Ithen stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65) that mentions the addition of the setName method to Vector. I'vetried building trunk, and although there were a few test failures forother (seemingly unrelated) examples I continued and managed to getthe mahout-examples jar/job files built to give it a whirl.


When I run the following:

$ hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.joborg.apache.mahout.clustering.syntheticcontrol.kmeans.Job

I see it run the "Preparing Input", "Running Canopy to get initialclusters", and then finally it starts "Running KMeans". But, shortlyafter it breaks with the following trace:


---snip---
Running KMeans

09/07/13 23:49:34 INFO kmeans.KMeansDriver: Input: output/dataClusters In: output/canopies Out: output Distance:org.apache.mahout.utils.EuclideanDistanceMeasure09/07/13 23:49:34 INFO kmeans.KMeansDriver: convergence: 0.5 maxIterations: 10 num Reduce Tasks: 1 Input Vectors:org.apache.mahout.matrix.SparseVector

09/07/13 23:49:34 INFO kmeans.KMeansDriver: Iteration 0

09/07/13 23:49:34 WARN mapred.JobClient: Use GenericOptionsParser forparsing the arguments. Applications should implement Tool for the same.09/07/13 23:49:34 INFO mapred.FileInputFormat: Total input paths toprocess : 209/07/13 23:49:34 INFO mapred.JobClient: Running job:job_200907132019_0040

09/07/13 23:49:35 INFO mapred.JobClient:  map 0% reduce 0%
09/07/13 23:49:42 INFO mapred.JobClient:  map 50% reduce 0%
09/07/13 23:49:43 INFO mapred.JobClient:  map 100% reduce 0%
09/07/13 23:49:49 INFO mapred.JobClient:  map 100% reduce 100%

09/07/13 23:49:50 INFO mapred.JobClient: Job complete:job_200907132019_0040

09/07/13 23:49:50 INFO mapred.JobClient: Counters: 16
09/07/13 23:49:50 INFO mapred.JobClient:   File Systems
09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes read=465629
09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes written=5631
09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes read=7806
09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes written=15674
09/07/13 23:49:50 INFO mapred.JobClient:   Job Counters
09/07/13 23:49:50 INFO mapred.JobClient:     Launched reduce tasks=1
09/07/13 23:49:50 INFO mapred.JobClient:     Launched map tasks=2
09/07/13 23:49:50 INFO mapred.JobClient:     Data-local map tasks=2
09/07/13 23:49:50 INFO mapred.JobClient:   Map-Reduce Framework
09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input groups=7
09/07/13 23:49:50 INFO mapred.JobClient:     Combine output records=10
09/07/13 23:49:50 INFO mapred.JobClient:     Map input records=600
09/07/13 23:49:50 INFO mapred.JobClient:     Reduce output records=7
09/07/13 23:49:50 INFO mapred.JobClient:     Map output bytes=465600
09/07/13 23:49:50 INFO mapred.JobClient:     Map input bytes=448580
09/07/13 23:49:50 INFO mapred.JobClient:     Combine input records=600
09/07/13 23:49:50 INFO mapred.JobClient:     Map output records=600
09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input records=10

09/07/13 23:49:50 WARN kmeans.KMeansDriver: java.io.IOException:Cannot open filename /user/paul/output/clusters-0/_logsjava.io.IOException: Cannot open filename /user/paul/output/clusters-0/_logsat org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1394)at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1385)

        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:338)

atorg.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:171)at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)atorg.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:304)atorg.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:241)atorg.apache.mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:194)atorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:100)atorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:56)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)atsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
---snip---

This is against revision 793689, running on my development Mac Pro(pseudo-distributed single node) with Hadoop 0.19.1.

It's a bit late to be digging through what's going on, but will tryand take a look tomorrow- really excited about giving kmeans a whirlon the document processing I'm playing with. In the meantime, I waswondering whether anyone else had seen the same, or knew a way toaccomplish something similar with the released version (or point me toa past good revision perhaps?)


Thanks again,
Paul

Error with KMeans example in trunk (793689)

Reply via email to