It looks to me like all of your data points are sparse and empty. Check your input vectors for nonzero values :)
-----Original Message----- From: surf reta [mailto:[email protected]] Sent: Wednesday, August 10, 2011 2:05 PM To: [email protected] Subject: Re: issues on Mahout clustering result using K-means Hi Jeff, with respect to the clusterdump result for K-means-generated clusters, I get sth like VL-0{n=100 c=[] r=[]} Weight: Point: 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] With respect to the clusterdump result for canopyCentroids/cluster-0, I get sth like C-0{n=1 c=[] r=[]} Weight: Point: 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] 1.0: [] I am really confusing about the physical meanings of these results. Thanks. On Wed, Aug 10, 2011 at 12:31 PM, Jeff Eastman <[email protected]> wrote: > Run clusterdump -s canopyCentroids/clusters-0. Generally, Mahout arguments > are directories full of part-n files. You can also run clusterdump -s > clusterOutput/clusters-n -p .../clusteredPoints after KMeans to see the > results of your clustering. Argument 'n' would be the last iteration number. > > -----Original Message----- > From: surf reta [mailto:[email protected]] > Sent: Wednesday, August 10, 2011 9:19 AM > To: [email protected] > Subject: Re: issues on Mahout clustering result using K-means > > Hi Jeff, > > I frist transferred a set of text files into sequence files through a > customized program as follows. This program uses the Mahout utility of > SequenceFilesFromDriectory > > public class TestSequenceFileConverter { > > public static void main(String args[]){ > > String inputDir = "testdataset"; > String outputDir = "sequenceInputDir"; > try{SequenceFilesFromDirectory.main(new String[] {"--input", > inputDir.toString(), "--output", outputDir.toString(), > "--chunkSize", > "64", "--charset",Charsets.UTF_8.name()});} > catch(Exception e){System.out.println("");} > > } > > } > > > Then I ran the K-means program, borrowed from NewsKMeansClustering, an > example program given in Mahout-in-Action, to run against these generated > sequence files. > > I just checked the generated clusters-0 directory, it has a file called > part-r-00000. How can I read this file and get the useful information from > it? Thanks. > > The NewsKMeansClustering is listed here for your reference:* > * > > public class NewsKMeansClustering { > > public static void main(String args[]) throws Exception { > > int minSupport = 5; > int minDf = 5; > int maxDFPercent = 95; > int maxNGramSize = 2; > int minLLRValue = 50; > int reduceTasks = 1; > int chunkSize = 200; > int norm = 2; > boolean sequentialAccessOutput = true; > > // String inputDir = "inputDir"; > > String inputDir = "sequenceInputDir"; > > Configuration conf = new Configuration(); > FileSystem fs = FileSystem.get(conf); > /* > * SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, new > Path(inputDir, "documents.seq"), > * Text.class, Text.class); for (Document d : Database) { > writer.append(new Text(d.getID()), new > * Text(d.contents())); } writer.close(); > */ > > String outputDir = "newsClusters"; > HadoopUtil.delete(conf, new Path(outputDir)); > Path tokenizedPath = new Path(outputDir, > DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER); > MyAnalyzer analyzer = new MyAnalyzer(); > DocumentProcessor.tokenizeDocuments(new Path(inputDir), > analyzer.getClass() > .asSubclass(Analyzer.class), tokenizedPath, conf); > > DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, > new Path(outputDir), conf, minSupport, maxNGramSize, minLLRValue, 2, > true, reduceTasks, > chunkSize, sequentialAccessOutput, false); > TFIDFConverter.processTfIdf( > new Path(outputDir , > DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), > new Path(outputDir), conf, chunkSize, minDf, > maxDFPercent, norm, true, sequentialAccessOutput, false, reduceTasks); > Path vectorsFolder = new Path(outputDir, "tfidf-vectors"); > Path canopyCentroids = new Path(outputDir , "canopy-centroids"); > Path clusterOutput = new Path(outputDir , "clusters"); > > CanopyDriver.run(vectorsFolder, canopyCentroids, > new EuclideanDistanceMeasure(), 250, 120, false, false); > KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, > "clusters-0"), > clusterOutput, new TanimotoDistanceMeasure(), 0.01, > 20, true, false); > > SequenceFile.Reader reader = new SequenceFile.Reader(fs, > new Path(clusterOutput+"/" + Cluster.CLUSTERED_POINTS_DIR + > "/part-m-00000"), conf); > // new Path(clusterOutput+"/clusteredPoints"+"/part-m-00000"),conf); > > IntWritable key = new IntWritable(); > WeightedVectorWritable value = new WeightedVectorWritable(); > while (reader.next(key, value)) { > System.out.println(key.toString() + " belongs to cluster " > + value.toString()); > } > reader.close(); > } > } > > > > On Wed, Aug 10, 2011 at 11:40 AM, Jeff Eastman <[email protected]> wrote: > > > What do your input vectors look like? > > How many canopies did you get in clusters-0? > > > > -----Original Message----- > > From: eric skinner [mailto:[email protected]] > > Sent: Wednesday, August 10, 2011 8:33 AM > > To: [email protected] > > Subject: issues on Mahout clustering result using K-means > > > > I ran the K-means clustering algorithm against a set of sequence files. > > However, the generated result looks like this: > > > > 0 belongs to cluster 1.0: [] > > > > 0 belongs to cluster 1.0: [] > > > > 0 belongs to cluster 1.0: [] > > > > 0 belongs to cluster 1.0: [] > > > > 0 belongs to cluster 1.0: [] > > > > 0 belongs to cluster 1.0: [] > > > > Would you like to let me know why I get this type of result? Is that > > because > > of any specific parameter setting requirement or anything else? > > > > The program I use is borrowed from NewsKMeansClustering.java, an example > > given in chapter 9 of Mahout-in-Action. > > > > The core clustering code in this program is > > > > CanopyDriver.run(vectorsFolder, canopyCentroids, new > > EuclideanDistanceMeasure(), 250, 120, false, false); > > > > KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, > > "clusters-0"), > > clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false); > > >
