>>>>> "Dylan" == Dylan Beaudette <[EMAIL PROTECTED]> >>>>> on Mon, 22 May 2006 17:33:47 -0700 writes:
Dylan> Greetings, Experimenting with the cluster package, Dylan> and am starting to scratch my head in regards to the Dylan> *best* way to standardize my data. Both functions can Dylan> pre-standardize columns in a dataframe. according to Dylan> the manual: Dylan> Measurements are standardized for each variable Dylan> (column), by subtracting the variable's mean value Dylan> and dividing by the variable's mean absolute Dylan> deviation. Dylan> This works well when input variables are all in the Dylan> same units. When I include new variables with a Dylan> different intrinsic range, the ones with the largest Dylan> relative values tend to be _weighted_ . this is Dylan> certainly not surprising, but complicates things. Dylan> Does there exist a robust technique to effectively Dylan> re-scale each of the variables, regardless of their Dylan> intrinsic range to some set range, say from {0,1} ? Dylan> I have tried dividing a variable by the maximum value Dylan> of that variable, but I am not sure if this is Dylan> statistically correct. A more usual scaling standardization is accomplished by the function -- guess what? -- scale() It defaults to standardize to mean 0 and std. 1. But you can use it as well to do a [0,1] scaling. Note that you are very wise to think about the importance of variable scaling / weighting for cluster analysis. But people have been "here" before, and invented the much more general notion of a distance/dissimilarity between observational units. --> function daisy() {in "cluster"} or dist() {from "stats"} provide such dissimilarity objects. These can be used as input for pam() or clara() as well, and in constructing them you are much more flexible than trying to find a proper scaling of your x-matrix. Note that daisy() in particular has been designed for computing sensible dissimilarities for the case when X-matrix has a collection of continuous {eg "interval scaled"} and of categorical (e.g binary) variables. I recommend you get a textbook on clustering, to read up more on the subject. Regards, Martin Maechler, ETH Zurich Dylan> Any ideas, thoughts would be greatly appreciated. Dylan> Cheers, Dylan> -- Dylan Beaudette Soils and Biogeochemistry Graduate Dylan> Group University of California at Davis 530.754.7341 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html