[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840574#action_12840574
 ] 

Rohini Uppuluri commented on MAHOUT-153:
----------------------------------------


 Hi, 

Please find a brief description on input and output below:
Hope this helps:

------------------------------------------------------------------------------

Input Format:
documentId\tdocument vector


Example line:

338          [s1682, 275:5.0, 478:3.0, 479:5.0, 1:3.0, 474:4.0, 143:2.0, 
197:5.0, 196:2.0, 286:4.0, 135:5.0, 

86:4.0, 216:4.0, 83:2.0, 213:5.0, 215:3.0, 208:3.0, 269:4.0, 517:5.0, 169:5.0, 
654:5.0, 443:5.0, 

990:4.0, 175:4.0, 513:5.0, 514:5.0, 650:5.0, 525:4.0, 1124:4.0, 382:5.0, 
708:5.0, 497:3.0, 498:4.0, 

523:3.0, 427:4.0, 488:5.0, 490:5.0, 189:4.0, 52:5.0, 301:4.0, 607:4.0, 180:4.0, 
] 


Output Format:
ClusterIdentifier\tClusterIdentifier: clusterCenterVector

Example line:
C0      C0: [s1682, 275:3.0, 1:4.0, 273:5.0, 272:2.0, 3:1.0, 546:4.0, 277:3.0, 
276:3.0, 7:5.0, 283:3.0, 

282:4.0, 9:1.0, 281:4.0, 12:5.0, 1089:2.0, 13:1.0, 286:1.0, 14:1.0, 15:2.0, 
284:3.0, 258:4.0, 17:3.0, 

257:5.0, 23:5.0, 25:2.0, 264:4.0, 270:5.0, 271:3.0, 31:5.0, 305:1.0, 1405:3.0, 
307:4.0, 39:5.0, 

311:3.0, 310:3.0, 515:5.0, 313:5.0, 525:5.0, 315:3.0, 316:4.0, 288:3.0, 50:5.0, 
532:4.0, 291:5.0, 

292:4.0, 55:5.0, 293:4.0, 294:3.0, 295:5.0, 298:4.0, 56:5.0, 300:5.0, 539:2.0, 
302:4.0, 343:4.0, 

882:4.0, 340:1.0, 887:4.0, 1025:4.0, 619:3.0, 79:5.0, 347:2.0, 346:4.0, 
345:1.0, 344:1.0, 326:4.0, 

327:3.0, 1051:4.0, 322:4.0, 323:4.0, 628:2.0, 333:4.0, 331:4.0, 1047:4.0, 
328:4.0, 636:4.0, 100:5.0, 

98:5.0, 581:4.0, 370:3.0, 591:3.0, 118:5.0, 595:4.0, 117:4.0, 358:2.0, 597:4.0, 
127:5.0, 1073:4.0, 

603:5.0, 121:5.0, 683:4.0, 413:3.0, 678:4.0, 950:4.0, 405:4.0, 156:5.0, 
696:4.0, 1244:4.0, 147:5.0, 

690:3.0, 928:3.0, 151:1.0, 924:3.0, 443:4.0, 654:5.0, 925:2.0, 649:4.0, 
164:5.0, 642:4.0, 185:5.0, 

431:5.0, 905:4.0, 1278:4.0, 176:4.0, 183:5.0, 657:5.0, 898:1.0, 181:4.0, 
659:4.0, 1016:4.0, 477:1.0, 

751:4.0, 475:4.0, 750:4.0, 203:5.0, 472:2.0, 748:3.0, 471:5.0, 1011:2.0, 
466:5.0, 742:5.0, 1013:3.0, 

1014:4.0, 762:4.0, 222:5.0, 760:3.0, 460:4.0, 458:3.0, 218:4.0, 237:4.0, 
235:3.0, 504:5.0, 717:4.0, 

234:4.0, 991:1.0, 233:5.0, 978:2.0, 229:5.0, 226:5.0, 254:1.0, 255:4.0, 
252:3.0, 250:5.0, 248:4.0, 

245:4.0, ] 





> Implement kmeans++ for initial cluster selection in kmeans
> ----------------------------------------------------------
>
>                 Key: MAHOUT-153
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-153
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.2
>         Environment: OS Independent
>            Reporter: Panagiotis Papadimitriou
>            Assignee: Ted Dunning
>             Fix For: 0.4
>
>         Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current implementation of k-means includes the following algorithms for 
> initial cluster selection (seed selection): 1) random selection of k points, 
> 2) use of canopy clusters.
> I plan to implement k-means++. The details of the algorithm are available 
> here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
> Design Outline: I will create an abstract class SeedGenerator and a subclass 
> KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
> become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to