I wanted to move the discussion to this list because I know that there are formal ways of comparing clustering but I do not recall them and I wanted to have people with more specific knowledge be able to respond with more expertise in the formal aspects.
- - - -
More specific questions are easier to answer than general questions:
Are the results simple clusters or whole trees? If whole trees, do you care about the most detailed levels or are you interested in the levels closest to the root.
Are you looking to classify all cases or most cases?
Are your clusterings based on the same method-index combinations? Are the cases input in the same order? Are there necessarily many ties in the distances? Are they based on the same set of cases but different variables? Different cases but same variables?
How many cases do you have? In what ways to you perceive the clusterings to be different?
What kinds of cases to you have? What constructs do your variables represent? What are the levels of measurement of your variables?
- - - - One quick and dirty way to compare clusterings is to use SPSS. (Perhaps others can tell you more formal ways but you would probably still have to assemble memberships into a single file). First, use one of 3 ways SPSS has to do clustering. 1) CLUSTER this has several algorithms for hierarchical models based on a few dozen similarity indices. It allows you to save cluster membership in one of 3 ways: (You can specify a different rootname for each method and index combination.) When you want part of the tree you can give a variable root name /save=john_ (3,11) will save membership for each case in variables called john_3 for the level with 3 groups john_4 for the level with 4 groups . . . John_11 for the level with 11 groups. when you want to start at the root simply specify /save= mary_ (1,21) If you want a particular slice specify /save=bill_ (18) 2) QUICK CLUSTER uses KMEANS to find a specific number of clusters. It is useful for ratio, interval, ordinal, or dichotomous data. For each run with a specified number of clusters you can save the cluster membership of each case and its distance from its cluster center. you can specify different rootnames for the variable name these are stored in. 3)TWOSTEP CLUSTER is used when you have a set of categorical variables and a set of continuous variables. For each run with a specified number of clusters you can save the cluster membership of each case and its distance from its cluster center. You can specify different rootnames for the variable name these are stored in.
Then do a series of CROSSTABS to see which subgroups are pretty much the same. Remember that that order in which clusters are formed will depend on the method-index combination. For this purpose memberships are strictly nominal level. In the policy/social/psycological domains it is customary to do many approaches and accept groups that are recognized by several methods. The more disparate the set of approaches that find the same clusters the more valid the solution is likely to be since reliability is considered the upper limit of validity.
You also can use standalone programs or other packages to produce files that output case id's and membership info and match/merge into a file to do the crosstabs.
Hope this helps.
Art [EMAIL PROTECTED] Social Research Consultants University Park, MD USA (301) 864-5570
Sirotkin, Alexander wrote: > Hello. > > I'm trying to find a way to compare two clusterings (two results of a > clustering > algorithm). > > Is there any algorithm (or better yet - working softwar package for > S-Plus/R/whatever) ? > > Thanks a lot... > > P.S. Comparing the results visually is not an option - too many records... >
