Re: categorical data analysis

Donald F. Burrill Sat, 22 Apr 2000 23:24:26 -0700
It is 0045 Easter Day, and I am a church organist.  May not have time
to address all the points you raise.  If I leave some out, perhaps 
someone else on the list will discuss further...

On Sat, 22 Apr 2000 [EMAIL PROTECTED] wrote:

> The dependent variable is based on the following question.
> 
> Recently the problem of overcrowing in the state university system 
> in Florida has been the subject of considerable debate. Some have 
> suggested the elimination of preferential treatment in college 
> admissions as a potential remedy.  In your opinion, which of the 
> following groups listed should continue receiving preferential 
> treatment in the admissions process. (1 = yes, 0 = no)
> 
> q45a athelets                                    (1, 0)
> q45b national merit scholars (honor students)    (2, 0)
> q45c economically disadvantaged                  (3, 0)
> q45d historically disadvantaged                  (4, 0)
> q45e children of wealthy benefactors             (5, 0)
> q45f children of university alumni               (1, 0)
> q45g ethnic and racial minorities                (2, 0)
> q45h students with disabilities                  (3, 0)
> q45i students with prior criminal records        (4, 0)
> q45j students with unique artistic talents       (5, 0)
> 
> The goal is to determine whether or not a persons choices on the 
> 10 questions is predictable in terms of independent variable values 
> of education, age, race, income, marital status, political party, sex, 
> racial attitudes, etc

Ah. Now I see what you meant by "multivalued".  It hadn't registered 
earlier.  What I would do is first search to see what patterns actually 
occur.  Saves defining several hundred 2x2 tables that don't exist in the 
data, for example.  Starting with your variables q45a-q45e, I'd recode as
shown above.  (This recoding is not _logically_ necessary, but it sure 
makes it easier for humans to identify patterns.)  I would then either 
re-read the data as a 5-digit number, if that were convenient in the 
available software, or construct a 5-digit number from values extant in 
the file, thus (using your (1,0) extant codes:

  let q45.1 = q45a*10000 + q45b*2000 + q45c*300 + q45d*20 + q45e*5 

 The possible values of q45.1 are 32:  
         0     300    2000    2300   10000   10300   12000   12300
         5     305    2005    2305   10005   10305   12005   12305
        40     340    2040    2340   10040   10340   12040   12340
        45     345    2045    2345   10045   10345   12045   12345

(Equivalently, you could define this variable in binary rather than 
decimal code:
                let q45.1b = q45a*16 + q45b*8 + q45c*4 + q45d*2 + q454e 

which produces values from 0 to 31 by consecutive integers;  and you may 
prefer this.  Me, I find the 5-digit codes informative -- I can see at a 
glance which items have been checked by the respondent.  As a mild 
refinement on presentation, the output of a frequency-counting routine 
can be edited to substitute "." for every "0", and I find these patterns 
even clearer to read & interpret:

         .     3..    2...    23..   1....   1.3..   12...   123..
         5     3.5    2..5    23.5   1...5   1.3.5   12..5   123.5
        4.     34.    2.4.    234.   1..4.   1.34.   12.4.   1234.
        45     345    2.45    2345   1..45   1.345   12.45   12345

Of course, in a system like SPSS you could code the responses in 5-digit 
binary and assign value labels as above.
        Now do the same for q45f-q45j, to produce q45.2.  Cross-classify 
these two new variables.  (Maximum size of table:  32x32 = 1024 cells. 
Actual size:  less, because some cells, and probably also some rows and 
columns, will be empty.  This will of course necessarily be the case if 
N < 1000, but even if N = 50,000 it would be surprising if ALL possible 
combinations were chosen.)
        For the question described, it is not clear to me that any 
inferential statistics is necessary.  What more does the client need than 
to be able to say things like 
        "Of 3219 respondents, at most 53% favored eliminating 
preferential treatment for any one of the ten categories;  by category, 
  53% favored e.p.t. for students with prior criminal records,
  48%    "      "     "  students with disabilities,
     . . . 
   5%    "      "     "  children of university alumni" ?
 (I'd list them in descending order of %).  In a technical appendix, one 
might want to add the 95% margins of error for these various %s, but I 
cannot see a need for hypothesis testing.  (What _substantive_ hypotheses 
would be interesting to test?  Answers to this question are NOT in the 
form "one dichotomy is independent of another dichotomy", but in language 
that makes sense to the client.)

These %s would include those who chose this category only AND those who 
chose this category in combination with one or more others.  Combinations 
that were favored by an interestingly large % should also be mentioned, 
of course, but in another paragraph entirely.

I had grumbled:

> > ...  So far as I can tell from all that algebra, you're effectively 
> > substituting a whole bunch of 2x2 tables for a single RxC table 
> > (R = number of rows, C = number of columns) with R>2 and(/or?) C>2.  
> > Or, for each of several RxC tables. 
> 
> Yes.  This is the intent.  I wanted to reduce the R by C table to a 
> series of 2 by 2 tables.

Yes, well, I still don't see why this is advantageous.  Or perceived as 
advantageous.

> >  Why do you not first do the obvious contingency table chi-square 
> > to see if there's anything worth following up?  (And if I were doing 
> > it, the follow-up(s) would be in the RxC format as well.)
> 
> It is because the dependent variable is multivalued.  A person may 
> check 0, 1,2 ,3,4,5,6,7,8,9 or 10 of the preferences.  I want to be 
> able to identify which of the choices was checked, not just the 
> number of them.  There is no theory for weighing the choices before 
> the analysis is complete.  I wanted to construct a method for 
> systematically examining all 1023 possible choices of dependent 
> variable.

This, however, doesn't require 1023 hypothesis tests.  See above. 

        < snip, discussion of error in chi square formula >

> Do you mean that I've stated correctly that the null hypothesis for 
> the chi square test on crosstab is that the variables are 
> independent?
                Yup.  Why do you sound so surprised?

        <  snip  >

> >  Here it begins to get sticky.  I cannot tell whether you mean the 
> >  same thing by "interaction" that I would mean.  In particular, 
> >  there seems to be no difference between "interaction variable", 
> >  in your terms, and "indicator variable", in my terms.
> 
> I'm not sure what you mean by "indicator variable".  I've only seen 
> the term in connection with latent variables in structural equation 
> modeling.

Sometimes called ("mis-called", Joe Ward would say) "dummy variables", 
value = 1 (if a member of the indicated set) or 0 (if not).  More 
precisely, I should have written "product of indicator variables" in 
the last line of my paragraph just above.

> I'm pretty sure my use of "interaction" is consistent with your use 
> of "interaction".

Well, it is customary to _model_ interaction in linear models by 
multiplying together the variables whose interaction is to be examined.  
But strictly speaking, the "interaction variable" is the [logical?] part 
of this product that is not correlated with, or "explained by", those 
variables (or by their lower-order interactions).  But as remarked above, 
I do not see what mileage you can get out of testing hypotheses, and the 
interesting information, in my view, are the proportions (or %s) of the 
sample who chose various combinations...  AND, quite probably, how those 
proportions (or %s) change according to characteristics of the 
respondents (which I assume to be another part of the enterprise:  
certainly MY university would be interested in knowing whether alumni, 
faculty, students, potential benefactors, ... expressed similar opinions; 
wouldn't EXPECT them to, but to the extent that they did it would 
simplify some of the decision-making).

        <  snip, detailsl of "interactions" ...  >

> > > Suppose R is an ordinal level variable with values r1 < r2 < r3 < r4.
> > > Then R is converted to three variables S1,S2,S3 with
> > > 
> > > S1 = 1 if R = r1 and S1 = 0 otherwise.
> > > S2 = 1 if R = r1 or r2 and S2 = 0 otherwise.
> > > S3 = 1 if R = r1 or r2 or r3 and S3 = 0 otherwise.
> > 
> > Possible, but doesn't seem necessary.  Why not leave R as it is instead of
> > constructing dichotomies?  
> 
> Because I'm focused on reducing the analysis to comparing the 
> results of a series of 2 by 2 tables.

Yes, I'd rather thought so.  Do you mind my saying that it looks from 
here almost like a monomania?  ;-)

> The goal will be to have the computer analyse the 2 by 2 tables 
> and give summary results to the researcher.  

The way you phrase this puts me in mind of a saying of a Canadian 
colleague of mine, with respect to some computer-generated statistical 
analyses:
                "untouched by the human mind".

        <  snip  >

> >  (You defined a "Col pct" as t/D, which is not a % but a proportion.)
> 
> Right.   As a mathematician I completely ignore the distinction 
> between % and proportion.

Hmph.  I _hope_ you don't ignore the distinction between where one places 
the decimal point.

        <  snip, details of chi-square formula >

> > > The significance number that is calculated for a statistic is the
> > > predicted probability that the null hypothesis is true.
> 
> > I very much doubt it.  This certainly does not conform to the standard
> > statistical definition of "level of significance", which I assume to be
> > what you want us to understand by "significance number".
> 
> I would like to discuss this point more.  Why does it not conform to 
> the standard statistical definition of level of significance?
        Read on ... 

> We use the p-value calculated for a statistic by comparing it to a 
> preset alpha.  Alpha is the probability of a type one error.  

        Some there are who argue against using fixed, preset alpha.  
        But you correctly describe the process.

> A type one error is rejecting the null hypothesis when it is true. 

        O.K. so far.

> It seemed intuitive to me that we had to say  p  was the predicted 
> probability of the null hypothesis being true, 

        We don't _have_ to say anything at all;  but what we do say
        should be either true, or logically arguable.  This is neither. 

> and that we would reject the null 
> hypothesis if this predicted probability fell below alpha.

        This is correct:  if for "this predicted probability", meaning 
        "Pr{null hypothesis is true}" you substitute 
        "observed Pr{Type I error}" or, equivalently, 
        "Pr{departure this large or larger from H_null | H_null true}". 

> Does this make sense to you, or am I falling into some standard 
> trap of beginning statistics students?
                                                The latter, I'm afraid.
The purpose in invoking a probability at all is to be able to put some 
kind of bound on the probability that the decision one reports is wrong. 
Oversimplifying, as usual, there are two ways one can be wrong:  in 
rejecting a null hypothesis that is true, and in accepting a null 
hypothesis that is not true.  Sometimes this is represented schematically 
thus:
                         True state of the universe ("known but to God"):
 Investigator's            Hypothesis true         Hypothesis false
 decision:
       Reject hypothesis        Error I             Correct decision

       Accept hypothesis     Correct decision           Error II

Some folks insist on "Fail to reject" instead of "Accept".  This is 
proper when (as is often the case) one cannot specify a probability 
distribution associated with "Hypothesis false".  One function of the 
specification of a null hypothesis is to be able to specify a sampling 
distribution for the statistic being observed in the state "Hypothesis 
true".  This then makes it possible to describe (estimate) the 
probability that one's data are consistent with the null hypothesis, and 
to reject the null if that probability is persuasively low (viz., less 
than alpha).  But this probability is NOT "the predicted probability 
that the null hypothesis is true" -- it is the CONDITIONAL probability 
of observing data like this, or more distant from the value specified in 
the null hypothesis than this, IF the null hypothesis be true. 
 No conditional probability can describe the probability of its own 
condition. 

        <  snip, the rest, most of which is based on the logical 
                error(s?) addressed above >

It's 2:15 and past my bedtime.
                                        -- DFB.
 ------------------------------------------------------------------------
 Donald F. Burrill                                 [EMAIL PROTECTED]
 348 Hyde Hall, Plymouth State College,          [EMAIL PROTECTED]
 MSC #29, Plymouth, NH 03264                                 603-535-2597
 184 Nashua Road, Bedford, NH 03110                          603-471-7128  



===========================================================================
This list is open to everyone.  Occasionally, less thoughtful
people send inappropriate messages.  Please DO NOT COMPLAIN TO
THE POSTMASTER about these messages because the postmaster has no
way of controlling them, and excessive complaints will result in
termination of the list.

For information about this list, including information about the
problem of inappropriate messages and information about how to
unsubscribe, please see the web page at
http://jse.stat.ncsu.edu/
===========================================================================
Re: categorical data analysis

Reply via email to