[
https://issues.apache.org/jira/browse/MAHOUT-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robin Anil updated MAHOUT-1084:
-------------------------------
Assignee: Robin Anil
> Kmeans for synthetic control example--there are 12 cluster during iterations.
> -----------------------------------------------------------------------------
>
> Key: MAHOUT-1084
> URL: https://issues.apache.org/jira/browse/MAHOUT-1084
> Project: Mahout
> Issue Type: Bug
> Reporter: liutengfei
> Assignee: Robin Anil
> Fix For: 0.8
>
>
> In Mahout-Kmeans for syntheticcontrol example, using the default
> parameters means to compute 6 clusters at last. But why there are 12 clusters
> during Kmeans iterations. According to my observation, the former 6 clusters
> and the latter 6 clusters are the same before the first iteration,those 6
> clusters are generatored by RandomSeedGenerator.java. Then the CIMapper will
> assign its own points to this 12 clusters. Is here existing logical errors?
> The 12 clusters are created by the function "setup" in CIMapper.java,
> more specifically, is the line "classifier.readFromSeqFiles(conf, new
> Path(priorClustersPath));", here the "priorClustersPath" means hdfs direction
> "output/clusters-0/", there are 8 files in this direction:
> "_policy","part-randomSeed"(one file record six cluster),"part-00000" to
> "part-00005"(total six files,every one record a cluster), while reading this
> direction, "_policy" will be filtered out, so program will read "part-00000"
> to "part-00005" to create six clusters, then read "part-randomSeed" to create
> the other six clusters, this is the reason why there will be 12 clusters
> before first iteration.
> Solution: delete associated code to avoid duplicately creating clusters
> in "output/clusters-0/", here i delete codes where create files: "part-00000"
> to "part-00005" in ClusterClassfier.java:
> public void writeToSeqFiles(Path path) throws IOException {
> writePolicy(policy, path);
> /*
> Configuration config = new Configuration();
> FileSystem fs = FileSystem.get(path.toUri(), config);
> SequenceFile.Writer writer = null;
> ClusterWritable cw = new ClusterWritable();
> for (int i = 0; i < models.size(); i++) {
> try {
> Cluster cluster = models.get(i);
> cw.setValue(cluster);
> writer = new SequenceFile.Writer(fs, config,
> new Path(path, "part-" + String.format(Locale.ENGLISH, "%05d",
> i)), IntWritable.class,
> ClusterWritable.class);
> Writable key = new IntWritable(i);
> writer.append(key, cw);
> } finally {
> Closeables.closeQuietly(writer);
> }
> }
> */
> }
> I don't know if it is still okay for other progams who using this file,
> but for KMeans in Syntheticcontrol example, program will create 6 clusters
> during every iterations as i expected.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira