Cluster center itself is a representative point. One pass over the data will get us that close enough points. Or exhaustively, we can just add it in the Kmeans Mapper and update a counter maybe?
Robin On Fri, Apr 9, 2010 at 4:13 AM, Jeff Eastman <j...@windwardsolutions.com>wrote: > Looking at the paper it doesn't seem to require MR for the final CDbw > calculation, right? For each cluster we only need to compare one of its > points with one point in each other cluster. With small numbers of > representative points per cluster that can be done easily in memory. I'd > love to see the code you have for computing representative points. > > Jeff > > > > Robin Anil wrote: > >> On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <j...@windwardsolutions.com >> >wrote: >> >> >> >>> Hi Robin, >>> >>> Interesting paper. I'm beginning to see how to MR the representative >>> point >>> selection already. The rest will hopefully become clearer with more >>> study. >>> Lots of MR jobs are needed to: >>> >>> >> >> >> >> >> >>> a) get the data into Vectors, We have something for text, missing for >>> other >>> formats >>> >>> >> >> >> >> >> >>> b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done >>> >>> >> >> >> >> >> >>> c) cluster the data, Done >>> >>> >> >> >> >> >> >>> d) iterate over the clustered data to derive representative points for >>> each >>> cluster, and finally Done ;) >>> >>> >> >> >> >> >> >>> e) produce the CDbw.- TODO >>> >>> >> >> >> >> >> >> >>> And, of course all of this is again iterated with different values for >>> the >>> clustering algorithm's parameters. Should keep the lights on at PG&E >>> producing power for the server farms. >>> >>> >>> >>> Robin Anil wrote: >>> >>> >>> >>>> Hi Jeff, >>>> This is an good paper with a simple measure of cluster quality >>>> measurement based on intra cluster density and inter cluster separation. >>>> Its >>>> pretty easy to compute. Need to make it a map/reduce job >>>> >>>> >>>> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw >>>> Robin >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >> > >