It depends on the kind of output. If we are just outputting only some numeric values then it is preferred to have SequenceFile as the data is written as binary. If not, it is preferred to write as simple text. Text file is readable where as binary is not readable.
As we consider the data as text in reducers of both Canopy and KMeans, I don't see any performance improvement in using SequenceFile. So, I used TextInputFormat which is read friendly. Thanks Pallavi -----Original Message----- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Thursday, March 19, 2009 10:19 AM To: mahout-dev@lucene.apache.org Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans Also why not consider just converting canopy? Which reader is better? Jeff Eastman wrote: > * PGP Signed: 03/18/09 at 21:37:36 > > Sure, why don't you go ahead and post a patch? > > > Pallavi Palleti (JIRA) wrote: >> [ >> https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.ji >> ra.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=126 >> 83312#action_12683312 >> ] >> Pallavi Palleti commented on MAHOUT-99: >> --------------------------------------- >> >> I have used KeyValueLineRecordReader internally for my code and >> forgot to revert back to SequenceFileReader. Will that be sufficient >> to add another patch on the latest code and modify only KMeansDriver >> to use SequenceFileReader? Kindly let me know. >> >> Thanks >> Pallavi >> >> >>> Improving speed of KMeans >>> ------------------------- >>> >>> Key: MAHOUT-99 >>> URL: https://issues.apache.org/jira/browse/MAHOUT-99 >>> Project: Mahout >>> Issue Type: Improvement >>> Components: Clustering >>> Reporter: Pallavi Palleti >>> Assignee: Grant Ingersoll >>> Fix For: 0.1 >>> >>> Attachments: MAHOUT-99-1.patch, Mahout-99.patch, >>> MAHOUT-99.patch >>> >>> >>> Improved the speed of KMeans by passing only cluster ID from mapper >>> to reducer. Previously, whole Cluster Info as formatted s`tring was >>> being sent. >>> Also removed the implicit assumption of Combiner runs only once >>> approach and the code is modified accordingly so that it won't >>> create a bug when combiner runs zero or more than once. >>> >> >> > > > * Jeff Eastman <j...@windwardsolutions.com> > * 0x6BFF1277 > > . >