Hi, everybody >On Wed, Sep 8, 2010 at 5:43 PM, Min-Han Tan ><minhan.scie...@gmail.com> wrote:
David said my R code text attachment got rejected by the mailing list. Pooh. I don't think that's nice. I don't see anything in the posting guide about a limit on text attachments. Well, if you are still trying to understand what 'orthogonal polynomial' means, I suggest you run the following through. I thought it was an enlightening experience. # Paul Johnson <paulj...@ku.edu> Nov. 16, 2005 # Ordinal predictors with a small number of possible values # Here is R code and commentary about ordinal predictors. # Please let me know if you have insights to explain this more clearly. set.seed(199999) sampleSize <- 100 subgroupSize <- sampleSize/5 # One of those "5 point opinion" questions: Strongly Disagree... # Strongly agree with values assigned 1,3,5,7,9 surveyQuestion <- c(rep(1,subgroupSize),rep(3,subgroupSize),rep(5,subgroupSize),rep(7,subgroupSize),rep(9,subgroupSize)) ### Make this an unordered factor myufac <- factor(surveyQuestion) ### Study the contrasts: contrasts(myufac) ### Make an ordered factor myofac <- ordered(surveyQuestion) contrasts(myofac) # another independent variable x <- rnorm(sampleSize) # a random error e <- rnorm(sampleSize) # First, suppose the output data is really just a # linear reflection of the surveyQuestion. It is created # by multiplying against the evenly spaced values # observed in the survey y1 <- -5 + 4*surveyQuestion + 6*x + 10 * e mod0 <- lm(y1~x + surveyQuestion) summary(mod0) # Question: are you making a mistake by just "throwing" # surveyQuestion into the analysis? You should not be # making a mistake, because you (luckily) guessed the right model # You might check by running the model with the unordered factor, # which amounts to running "dummy variables" mod1 <- lm(y1~x + myufac) summary(mod1) # or with the ordered factor, which estimates the orthogonal # polynomials mod2 <- lm(y1~x + myofac) summary(mod2) # Does either of those model appear to be "better" than # the original mod0? # If I did not know for sure the factor was ordered, I'd # be stuck with the treatment contrasts in mod1. And what # I would really like to know is this: are the coefficients # in mod1 "stepping up" in a clear, ordinal, evenly spaced pattern? # Since we know the data is created with a coefficient of 4 # we would expect that the coefficients would "step up" like this # myufac3 8 # myufac5 16 # myufac7 24 # myufac9 32 # See why we expect this pattern? The intercept "gobbles up" myufac1, # so each "impact" of the surveyQuestion has to be reduced by 4 units. # The impact of myufac3, which you expect might be 3*4=12, is reduced # to 3*4 - 4 = 8, and so forth. # But that does not happen with a smallish sample. You can run this # code a few times. It appears to me that a sample of more than # 10,000 is necessary to get any faithful reproduction of the "steps" # between values of surveyQuestion. # Question: would we mistakenly think that the uneveness observed in # summary(mod1) is evidence of the need to treat surveyQuestion as a # nominal factor, even though we know (since we created the data) that # we might as well just throw surveyQuestion in as a single variable? # How to decide? # We need a hypothesis test of the conjecture that # the coefficient estimates in mod1 fall "along a line" # i.e, myufac-i = (b2 * i ) - b2 # I believe that the correct test results from this command: anova(mod0, mod1, test="Chisq") # If that is significant, it means you are losing predictive # power by throwing in surveyQuestion as if it were a numerical # variable. # Now, what if we are sure that the data gathered in surveyQuestion is # really ordinal, and so we estimate mod2. # What do those parameters mean? Here I'm just reasoning/guessing # because I cannot find any complete idiot's guide to orthogonal # polynomials and their use in regression and/or R. # Take a look at the contrasts themselves # > ord1 <- contrasts(myofac) # ord1 # .L .Q .C ^4 # 1 -6.324555e-01 0.5345225 -3.162278e-01 0.1195229 # 3 -3.162278e-01 -0.2672612 6.324555e-01 -0.4780914 # 5 -3.287978e-17 -0.5345225 1.595204e-16 0.7171372 # 7 3.162278e-01 -0.2672612 -6.324555e-01 -0.4780914 # 9 6.324555e-01 0.5345225 3.162278e-01 0.1195229 # What does this mean? I believe: we act "as though" there are 4 # independent variables in the regression, L, Q, C, and ^4, and for # each respondent in the dataset, we choose a row of these numbers # to act as independent variables. A person who # answered 1 on the survey would have the input values (-.63, -.53, # -.31, 0.11) for those four variables. # This begs the question, what are L, Q, C, and ^4? ### This is tough to explain. # Background Recall from calculus that any function can be # approximated by a polynomial. Since surveyQuestion has only 5 # possible values, a polynomial of degree 4 is needed to "replace it" # in a regression model. It can replace it EXACTLY, not just # approximately. So we might fit # y1 <- a + b* x + c*surveyQuestion # + d*surveyQuestion^2 # + e*surveyQuestion^3 # + f*surveyQuestion^4 # That would give a "direct test" of whether you need to worry # about the coding of surveyQuestion. If d, e, and f are zero # then that means the inclusion of surveyQuestion as a linear # is the right way to go. If you get significant estimates on # ^2, ^3, and/or ^2, then you would know it was a mistake # to throw in surveyQuestion by itself. # Here is the big problem. # There would be *severe multicollinearity* in those estimates. # The standard errors would be huge, and the variance of the # estimated coefficients would be massive. That would happen # because the ^2 ^3 and ^4 variables are so proportional to each other # in many datasets. # But there is a way to re-scale those terms so they are not # multicollinear. So, instead of estimating the polynomial # directly, we use the "rescaled" orthogonal polynomial values. # Note that there is no correlation between these 4 columns of of the # orgthogonal contrasts. They are "orthogonal polynomials", so there # is absolutely no multicollinearity problem among them. # Observe: # > ord1 <- contrasts(myofac) # > t(ord1[,2])%*% ord1[,3] # [,1] # [1,] -3.579222e-17 # That's very very close to 0.0. # We can do all of those multiplications at once: check diagonal here # > t(ord1) %*% ord1 # .L .Q .C ^4 # .L 1.000000e-00 4.710858e-17 -7.123208e-17 2.143332e-17 # .Q 4.710858e-17 1.000000e-00 -3.579222e-17 -3.916680e-17 # .C -7.123208e-17 -3.579222e-17 1.000000e+00 3.582678e-16 # ^4 2.143332e-17 -3.916680e-17 3.582678e-16 1.000000e+00 # That is as much illinformed guessing as I can do right now about # orthogonal polynomials. # Anyway, if you run the regression mod2 and the higher # order terms are not significant, then it is a hint that your coding is # OK. That is, just "throw in" surveyQuestion. # And I believe a rigorous hypothesis test is obtained by anova (mod0, mod2, test="Chisq") # What's the good news? The Hypothesis Test result is EXACTLY THE # SAME whether we test the ordinal or the nominal coding. Whew! # Not such a big issue, then, whether we do factor or ordered when # deciding if we can just "throw" surveyQuestion into the model. # Now change the problem so that the "real data" is not created # by multiplying against the pure, unadulterated surveyQuestion # Now, the ordinal impact of the "real effect" is not reflected # perfectly well by the data of surveyQuestion surveyImpact <- NA surveyImpact[surveyQuestion==1] <- 0 surveyImpact[surveyQuestion==3] <- 5 surveyImpact[surveyQuestion==5] <- 6 surveyImpact[surveyQuestion==7] <- 9 surveyImpact[surveyQuestion==9] <- 14 y2 <- -5 + 4*surveyImpact + 6*x + 10 * e # Proceed as if you were not wrong and "throw in" survey question. mod3 <- lm(y2~x + surveyQuestion) summary(mod3) # Question: are you making a mistake? # If you are one of the people who says silly things like "my p values # are good, so I must have the correct model," the results should be # very very sobering. mod4 <- lm(y2~x + myufac) summary(mod4) # or with the ordered factor, which estimates the orthogonal # polynomials mod5 <- lm(y2~x + myofac) summary(mod5) anova(mod3,mod4, test="Chisq") anova(mod3,mod5, test="Chisq") # Again, the happy result that the 2 significance tests are the # same. Both tell you that you were just flat-out wrong to put # surveyQuestion into the regression model. -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.