[ 
https://issues.apache.org/jira/browse/MAHOUT-279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-279:
-----------------------------

    Attachment: MAHOUT-279.patch

Here's a different suggestion. The problem is efficiently picking a couple 
vectors out of billions. An M/R seems like such overkill. 

This patch just picks random points in the file, syncs, and reads. Unless the 
underlying implementation is awful, this should be super fast. The downside is 
the choice is slightly biased. We could fix that if needed.

I don't know if this works, is there a way to test reading on real input?

> Make RandomSeedGenerator a M/R Job
> ----------------------------------
>
>                 Key: MAHOUT-279
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-279
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-279.patch
>
>
> Speedup Random Centroid Selection for clustering using Map/Reduce

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to