Hi Jeff,
I frist transferred a set of text files into sequence files through a
customized program as follows. This program uses the Mahout utility of
SequenceFilesFromDriectory
public class TestSequenceFileConverter {
public static void main(String args[]){
String inputDir = "testdataset";
String outputDir = "sequenceInputDir";
try{SequenceFilesFromDirectory.main(new String[] {"--input",
inputDir.toString(), "--output", outputDir.toString(),
"--chunkSize",
"64", "--charset",Charsets.UTF_8.name()});}
catch(Exception e){System.out.println("");}
}
}
Then I ran the K-means program, borrowed from NewsKMeansClustering, an
example program given in Mahout-in-Action, to run against these generated
sequence files.
I just checked the generated clusters-0 directory, it has a file called
part-r-00000. How can I read this file and get the useful information from
it? Thanks.
The NewsKMeansClustering is listed here for your reference:*
*
public class NewsKMeansClustering {
public static void main(String args[]) throws Exception {
int minSupport = 5;
int minDf = 5;
int maxDFPercent = 95;
int maxNGramSize = 2;
int minLLRValue = 50;
int reduceTasks = 1;
int chunkSize = 200;
int norm = 2;
boolean sequentialAccessOutput = true;
// String inputDir = "inputDir";
String inputDir = "sequenceInputDir";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
/*
* SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, new
Path(inputDir, "documents.seq"),
* Text.class, Text.class); for (Document d : Database) {
writer.append(new Text(d.getID()), new
* Text(d.contents())); } writer.close();
*/
String outputDir = "newsClusters";
HadoopUtil.delete(conf, new Path(outputDir));
Path tokenizedPath = new Path(outputDir,
DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
MyAnalyzer analyzer = new MyAnalyzer();
DocumentProcessor.tokenizeDocuments(new Path(inputDir),
analyzer.getClass()
.asSubclass(Analyzer.class), tokenizedPath, conf);
DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
new Path(outputDir), conf, minSupport, maxNGramSize, minLLRValue, 2,
true, reduceTasks,
chunkSize, sequentialAccessOutput, false);
TFIDFConverter.processTfIdf(
new Path(outputDir ,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
new Path(outputDir), conf, chunkSize, minDf,
maxDFPercent, norm, true, sequentialAccessOutput, false, reduceTasks);
Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
Path canopyCentroids = new Path(outputDir , "canopy-centroids");
Path clusterOutput = new Path(outputDir , "clusters");
CanopyDriver.run(vectorsFolder, canopyCentroids,
new EuclideanDistanceMeasure(), 250, 120, false, false);
KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
"clusters-0"),
clusterOutput, new TanimotoDistanceMeasure(), 0.01,
20, true, false);
SequenceFile.Reader reader = new SequenceFile.Reader(fs,
new Path(clusterOutput+"/" + Cluster.CLUSTERED_POINTS_DIR +
"/part-m-00000"), conf);
// new Path(clusterOutput+"/clusteredPoints"+"/part-m-00000"),conf);
IntWritable key = new IntWritable();
WeightedVectorWritable value = new WeightedVectorWritable();
while (reader.next(key, value)) {
System.out.println(key.toString() + " belongs to cluster "
+ value.toString());
}
reader.close();
}
}
On Wed, Aug 10, 2011 at 11:40 AM, Jeff Eastman <[email protected]> wrote:
> What do your input vectors look like?
> How many canopies did you get in clusters-0?
>
> -----Original Message-----
> From: eric skinner [mailto:[email protected]]
> Sent: Wednesday, August 10, 2011 8:33 AM
> To: [email protected]
> Subject: issues on Mahout clustering result using K-means
>
> I ran the K-means clustering algorithm against a set of sequence files.
> However, the generated result looks like this:
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> Would you like to let me know why I get this type of result? Is that
> because
> of any specific parameter setting requirement or anything else?
>
> The program I use is borrowed from NewsKMeansClustering.java, an example
> given in chapter 9 of Mahout-in-Action.
>
> The core clustering code in this program is
>
> CanopyDriver.run(vectorsFolder, canopyCentroids, new
> EuclideanDistanceMeasure(), 250, 120, false, false);
>
> KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
> "clusters-0"),
> clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false);
>