Jeff
Richard Tomsett (JIRA) wrote:
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683252#action_12683252 ]Richard Tomsett commented on MAHOUT-99: --------------------------------------- Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get the same error on the Synthetic Control example. It seems to be because the new KMeans code uses a KeyValueLineRecordReader object to read the input cluster centres from the canopy clustering output, but the canopy clustering job outputs a SequenceFile (and the old KMeans code read in a SequenceFile for the cluster centres). Think that's the problem at least, I''ll have a quick play.Improving speed of KMeans ------------------------- Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
PGP.sig
Description: PGP signature