[ https://issues.apache.org/jira/browse/MAHOUT-279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-279: ----------------------------- Attachment: MAHOUT-279.patch Here's a different suggestion. The problem is efficiently picking a couple vectors out of billions. An M/R seems like such overkill. This patch just picks random points in the file, syncs, and reads. Unless the underlying implementation is awful, this should be super fast. The downside is the choice is slightly biased. We could fix that if needed. I don't know if this works, is there a way to test reading on real input? > Make RandomSeedGenerator a M/R Job > ---------------------------------- > > Key: MAHOUT-279 > URL: https://issues.apache.org/jira/browse/MAHOUT-279 > Project: Mahout > Issue Type: Improvement > Components: Clustering > Affects Versions: 0.3 > Reporter: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-279.patch > > > Speedup Random Centroid Selection for clustering using Map/Reduce -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.