[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840574#action_12840574 ]
Rohini Uppuluri commented on MAHOUT-153: ---------------------------------------- Hi, Please find a brief description on input and output below: Hope this helps: ------------------------------------------------------------------------------ Input Format: documentId\tdocument vector Example line: 338 [s1682, 275:5.0, 478:3.0, 479:5.0, 1:3.0, 474:4.0, 143:2.0, 197:5.0, 196:2.0, 286:4.0, 135:5.0, 86:4.0, 216:4.0, 83:2.0, 213:5.0, 215:3.0, 208:3.0, 269:4.0, 517:5.0, 169:5.0, 654:5.0, 443:5.0, 990:4.0, 175:4.0, 513:5.0, 514:5.0, 650:5.0, 525:4.0, 1124:4.0, 382:5.0, 708:5.0, 497:3.0, 498:4.0, 523:3.0, 427:4.0, 488:5.0, 490:5.0, 189:4.0, 52:5.0, 301:4.0, 607:4.0, 180:4.0, ] Output Format: ClusterIdentifier\tClusterIdentifier: clusterCenterVector Example line: C0 C0: [s1682, 275:3.0, 1:4.0, 273:5.0, 272:2.0, 3:1.0, 546:4.0, 277:3.0, 276:3.0, 7:5.0, 283:3.0, 282:4.0, 9:1.0, 281:4.0, 12:5.0, 1089:2.0, 13:1.0, 286:1.0, 14:1.0, 15:2.0, 284:3.0, 258:4.0, 17:3.0, 257:5.0, 23:5.0, 25:2.0, 264:4.0, 270:5.0, 271:3.0, 31:5.0, 305:1.0, 1405:3.0, 307:4.0, 39:5.0, 311:3.0, 310:3.0, 515:5.0, 313:5.0, 525:5.0, 315:3.0, 316:4.0, 288:3.0, 50:5.0, 532:4.0, 291:5.0, 292:4.0, 55:5.0, 293:4.0, 294:3.0, 295:5.0, 298:4.0, 56:5.0, 300:5.0, 539:2.0, 302:4.0, 343:4.0, 882:4.0, 340:1.0, 887:4.0, 1025:4.0, 619:3.0, 79:5.0, 347:2.0, 346:4.0, 345:1.0, 344:1.0, 326:4.0, 327:3.0, 1051:4.0, 322:4.0, 323:4.0, 628:2.0, 333:4.0, 331:4.0, 1047:4.0, 328:4.0, 636:4.0, 100:5.0, 98:5.0, 581:4.0, 370:3.0, 591:3.0, 118:5.0, 595:4.0, 117:4.0, 358:2.0, 597:4.0, 127:5.0, 1073:4.0, 603:5.0, 121:5.0, 683:4.0, 413:3.0, 678:4.0, 950:4.0, 405:4.0, 156:5.0, 696:4.0, 1244:4.0, 147:5.0, 690:3.0, 928:3.0, 151:1.0, 924:3.0, 443:4.0, 654:5.0, 925:2.0, 649:4.0, 164:5.0, 642:4.0, 185:5.0, 431:5.0, 905:4.0, 1278:4.0, 176:4.0, 183:5.0, 657:5.0, 898:1.0, 181:4.0, 659:4.0, 1016:4.0, 477:1.0, 751:4.0, 475:4.0, 750:4.0, 203:5.0, 472:2.0, 748:3.0, 471:5.0, 1011:2.0, 466:5.0, 742:5.0, 1013:3.0, 1014:4.0, 762:4.0, 222:5.0, 760:3.0, 460:4.0, 458:3.0, 218:4.0, 237:4.0, 235:3.0, 504:5.0, 717:4.0, 234:4.0, 991:1.0, 233:5.0, 978:2.0, 229:5.0, 226:5.0, 254:1.0, 255:4.0, 252:3.0, 250:5.0, 248:4.0, 245:4.0, ] > Implement kmeans++ for initial cluster selection in kmeans > ---------------------------------------------------------- > > Key: MAHOUT-153 > URL: https://issues.apache.org/jira/browse/MAHOUT-153 > Project: Mahout > Issue Type: New Feature > Components: Clustering > Affects Versions: 0.2 > Environment: OS Independent > Reporter: Panagiotis Papadimitriou > Assignee: Ted Dunning > Fix For: 0.4 > > Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > The current implementation of k-means includes the following algorithms for > initial cluster selection (seed selection): 1) random selection of k points, > 2) use of canopy clusters. > I plan to implement k-means++. The details of the algorithm are available > here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. > Design Outline: I will create an abstract class SeedGenerator and a subclass > KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will > become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.