Dear Weiwei, your question sounds a bit too general and complicated for the R-list. Perhaps you should look for personal statistical advice. The quality of methods (and especially distance choice) for down-sampling ceratinly depends on the structure of the data set. I do not see at the moment why you need any down-sampling at all, and you should find out first if and why it's a good thing to do (by whatever method).
An obvious candidate for a clustering algorithm would be pam/clara in package cluster, because this approach chooses points already in the data set as cluster centroids (and produces therefore a proper subsample), which does not apply to most other clustering methods. However, in C. Hennig and L. J. Latecki: The choice of vantage objects for image retrieval. Pattern Recognition 36 (2003), 2187-2196. the clustering approach has been clearly outperformed by some stepwise selection approaches for down-sampling - admittedly in a different kind of problem, but I think that the reasons for this may apply also to your situation, You can compare different clusterings (or choices of a subset) by cross-validation or bootstrap applied to the resulting decision tree in the classification problem. Best, Christian On Mon, 25 Jul 2005, Weiwei Shi wrote: > Dear listers: > > Here I have a question on clustering methods available in R. I am > trying to down-sampling the majority class in a classification problem > on an imbalanced dataset. Since I don't want to lose information in > the original dataset, I don't want to use naive down-sampling: I think > using clustering on the majority class' side to select > "representative" samples might help. So, my question is, which > clustering method should be tested to get the best result. I think the > key thing might be the selection of "distance" considering the next > step in which I would like to use decision trees. > > Please share your experience in using clustering (Any available > implementation outside R is also welcome) > > weiwei > -- > Weiwei Shi, Ph.D > > "Did you always know?" > "No, I did not. But I believed..." > ---Matrix III > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > *** NEW ADDRESS! *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html