RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Palleti, Pallavi Wed, 18 Mar 2009 21:55:00 -0700

There is a testcase in TestKMeansClustering.java which actually uses the output 
of Canopy as input. This testcase succeeded without any issue. But the thing 
here is, it doesn't use hdfs but uses the local file system. So, this might be 
the reason why it is succeeded without any issue.


Thanks
Pallavi



-----Original Message-----
From: Jeff Eastman [mailto:j...@windwardsolutions.com] 
Sent: Thursday, March 19, 2009 10:14 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

The unit tests dont care which format is used as long as it is consistent. The 
compiler helps enforce that. kMeans will run and its tests will pass. So will 
Canopy. When somebody runs the kMeans example it encounters the file format 
differences. Are all the examples run by the install? I'd be surprised.

Jeff


Palleti, Pallavi wrote:
> Yeah. But, I am wondering how the testcases succeeded? I ran them using "mvn 
> clean install" command.
>
> Thanks
> Pallavi
>
> -----Original Message-----
> From: Jeff Eastman [mailto:j...@windwardsolutions.com]
> Sent: Thursday, March 19, 2009 9:56 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
>
> The Synthetic Control kMeans job calls the Canopy job to build its initial 
> clusters as is commonly done. If the kMeans record format was changed and the 
> Canopy not changed accordingly, then everything would still compile but there 
> would be a mismatch when the kMeans mapper tried to read in the clusters.
>
> Jeff
>
>
> Richard Tomsett (JIRA) wrote:
>   
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.ji
>> r
>> a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1268
>> 3
>> 252#action_12683252 ]
>>
>> Richard Tomsett commented on MAHOUT-99:
>> ---------------------------------------
>>
>> Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get 
>> the same error on the Synthetic Control example. It seems to be because the 
>> new KMeans code uses a KeyValueLineRecordReader object to read the input 
>> cluster centres from the canopy clustering output, but the canopy clustering 
>> job outputs a SequenceFile (and the old KMeans code read in a SequenceFile 
>> for the cluster centres). Think that's the problem at least, I''ll have a 
>> quick play.
>>
>>   
>>     
>>> Improving speed of KMeans
>>> -------------------------
>>>
>>>                 Key: MAHOUT-99
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>          Components: Clustering
>>>            Reporter: Pallavi Palleti
>>>            Assignee: Grant Ingersoll
>>>             Fix For: 0.1
>>>
>>>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, 
>>> MAHOUT-99.patch
>>>
>>>
>>> Improved the speed of KMeans by passing only cluster ID from mapper to 
>>> reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
>>> Also removed the implicit assumption of Combiner runs only once approach 
>>> and the code is modified accordingly so that it won't create a bug when 
>>> combiner runs zero or more than once.
>>>     
>>>       
>>   
>>     
>
>

RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Reply via email to