Hi Robin,
Interesting paper. I'm beginning to see how to MR the representative
point selection already. The rest will hopefully become clearer with
more study. Lots of MR jobs are needed to: a) get the data into Vectors,
b) iterate (e.g. kmeans) over the data to produce a set of clusters, c)
cluster the data, d) iterate over the clustered data to derive
representative points for each cluster, and finally e) produce the CDbw.
And, of course all of this is again iterated with different values for
the clustering algorithm's parameters. Should keep the lights on at PG&E
producing power for the server farms.
Robin Anil wrote:
Hi Jeff,
This is an good paper with a simple measure of cluster quality
measurement based on intra cluster density and inter cluster separation. Its
pretty easy to compute. Need to make it a map/reduce job
http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
Robin