It depends on the kind of output. If we are just outputting only some numeric 
values then it is preferred to have SequenceFile as the data is written as 
binary. If not, it is preferred to write as simple text. Text file is readable 
where as binary is not readable. 

As we consider the data as text in reducers of both Canopy and KMeans, I don't 
see any performance improvement in using SequenceFile. So, I used 
TextInputFormat which is read friendly.
 
Thanks
Pallavi

-----Original Message-----
From: Jeff Eastman [mailto:j...@windwardsolutions.com] 
Sent: Thursday, March 19, 2009 10:19 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Also why not consider just converting canopy? Which reader is better?


Jeff Eastman wrote:
> * PGP Signed: 03/18/09 at 21:37:36
>
> Sure, why don't you go ahead and post a patch?
>
>
> Pallavi Palleti (JIRA) wrote:
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.ji
>> ra.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=126
>> 83312#action_12683312
>> ]
>> Pallavi Palleti commented on MAHOUT-99:
>> ---------------------------------------
>>
>> I have used KeyValueLineRecordReader internally for my code and 
>> forgot to revert back to SequenceFileReader. Will that be sufficient 
>> to add another patch on the latest code and modify only KMeansDriver 
>> to use SequenceFileReader? Kindly let me know.
>>
>> Thanks
>> Pallavi
>>
>>  
>>> Improving speed of KMeans
>>> -------------------------
>>>
>>>                 Key: MAHOUT-99
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>          Components: Clustering
>>>            Reporter: Pallavi Palleti
>>>            Assignee: Grant Ingersoll
>>>             Fix For: 0.1
>>>
>>>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, 
>>> MAHOUT-99.patch
>>>
>>>
>>> Improved the speed of KMeans by passing only cluster ID from mapper 
>>> to reducer. Previously, whole Cluster Info as formatted s`tring was 
>>> being sent.
>>> Also removed the implicit assumption of Combiner runs only once 
>>> approach and the code is modified accordingly so that it won't 
>>> create a bug when combiner runs zero or more than once.
>>>     
>>
>>   
>
>
> * Jeff Eastman <j...@windwardsolutions.com>
> * 0x6BFF1277
>
> .
>

Reply via email to