Re: [R] Cluster analysis on weighted survey data with continuous and categorical variables

Thomas Lumley Tue, 19 Mar 2013 12:41:50 -0700

On Wed, Mar 20, 2013 at 3:55 AM, Emma Gibson <waterbab...@hotmail.com>wrote:


> I am trying to perform cluster analysis on survey data where each
> respondent has answered several questions, some of which have categorical
> answers ("blue" "pink" "green" etc) and some of which have scale answers
> (rating from 1 to 10 etc).My problem is that certain age groups were
> over-sampled and I need to weight the data collected in order to accurately
> reflect the current population.Will it make a difference if I do the
> cluster analysis on the weighted data, and if so, how do I do cluster
> analysis on the weighted data?Any advice would be much appreciated!Thanks
> Emma
>


The unequal sampling will have some effect on most clustering methods (eg
not single-linkage, but k-means or average-linkage).  Whether this matters
depends on whether you have genuinely separate clusters in the population
or a general mush that you are trying to segment in some convenient way.

If you have genuine well-separated clusters, then ignoring the oversampling
is likely to do well.  If you don't, you will get a segementation into
clusters that partitions the over-sampled people too finely and the
under-sampled people too coarsely.

I don't know of any R functions that cluster with sampling weights.

If your data set is fairly small, you could expand it by making duplicates
(perhaps jittered) of some points, and cluster the expanded data set.  On
the other hand, if it is very large, you can thin it out to a uniform
sample by sampling from it with probability inversely proportional to the
original sampling probability.

   - thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster analysis on weighted survey data with continuous and categorical variables

Reply via email to