Re: [R-lang] Contrast Coding in R Regressions

Roger Levy Sat, 19 Sep 2009 20:19:33 -0700


On Sep 15, 2009, at 9:16 PM, Rachel Baker wrote:

Hi,
I've recently started using R to do regressions, using the 'lmer'function. I am currently re-running some analyses that originallyhad treatment coding, so that they now have contrast coding. Myquestion is about how to interpret contrast coded regression outputs.
One of my independent variables (nativeLanguage) has 3 levels:English, Chinese, and Korean. As this experiment was conducted inEnglish, participants in the English group were native speakers, andparticipants in the other two groups were non-native speakers. Inmy original treatment-coded analysis, English was the referencelevel. My output for e.g. 'langCompare.lmer =lmer(duration~nativeLanguage+(1|Subject), data=myData)' had lineslike:
Estimate Std. Errort value
nativeLanguageChinese              0.025920   0.002384  10.872
nativeLanguageKorean                -0.004416   0.002091  -2.112
As I understood it, such lines gave information about the comparisonbetween Chinese and English, and between Korean and English,respectively.
I contrast coded this variable with the code: 'contrasts(myData$nativeLanguage) = c(-1, .5, .5)' (after ordering the levels:English, Chinese, Korean). This was in order to compare the native(English) group to the non-native (Chinese and Korean) groups.After this contrast coding, my output had lines like:
                                        Estimate Std. Error t value
nativeLanguage1               0.10002   0.010113  11.242
nativeLanguage2              -0.00046   0.639887  1.388
I was wondering how to interpret this output. My guess is thatnativeLanguage1 is the comparison between the native and non-nativegroups, and native_language2 is the comparison between Chinese andKorean, but I haven't been able to find any resources to confirm this.


Hi Rachel,

Your guess is correct, but the situation may be a little morecomplicated than you think. First, you need to realize that youdidn't specify a complete contrast. Here's a little code snippet toillustrate:


> m <- 20
> n <- 3

> lang <- factor(rep(c("English", "Chinese", "Korean"),m*n),levels=c("English", "Chinese", "Korean"))

> old.contrasts <- contrasts(lang)
> contrasts(lang) <- c(1,-.5,-.5)
> new.contrasts <- contrasts(lang)

Now, let's take a look at the old and new contrast matrices:

> old.contrasts
        Chinese Korean
English       0      0
Chinese       1      0
Korean        0      1
> new.contrasts
        [,1]          [,2]
English  1.0 -5.551115e-17
Chinese -0.5 -7.071068e-01
Korean  -0.5  7.071068e-01

The value of old.contrasts derives from the fact that by default, Ruses contr.treatment for unordered factors, with the first level ofthe factor being the baseline (which for you is English, so that thecontrast matrix is all zeroes in the English row):


> options()$contrasts
        unordered           ordered
"contr.treatment"      "contr.poly"

The value of contrasts(lang) reflects the fact that -- quoting from ?contrasts -- "If too few [entries for the contrast matrix] aresupplied, a suitable contrast matrix is created by extending valueafter ensuring its columns are contrasts (orthogonal to the constantterm) and not collinear."

Now let's generate some artificial data and look at how to interpretmodels fit using the old and new contrast matrices:


> set.seed(3)
> beta <- c(0,0.26,-0.004)
> speaker <- rep(1:m,langs*n)
> b <- rnorm(m,0,0.1)
> y <- beta[lang] + b[speaker] + rnorm(3*m*n)
> contrasts(lang) <- old.contrasts
> print(m.old <- lmer(y ~ lang + (1 | speaker),REML=F))
[...]
Fixed effects:
            Estimate Std. Error t value
(Intercept) -0.01403    0.13236 -0.1060
langChinese  0.49860    0.18719  2.6636
langKorean  -0.11447    0.18719 -0.6115
[...]
> contrasts(lang) <- new.contrasts
> print(m.new <- lmer(y ~ lang + (1 | speaker),REML=F))
[...]
Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.11402    0.07642   1.492
lang1       -0.12804    0.10807  -1.185
lang2       -0.43350    0.13236  -3.275
[...]

Ignoring speaker-specific effects, the predicted mean for a givenlanguage is the intercept plus the dot product of the language'scontrast-matrix representation with the coefficients for the languagefactor. Since the two models are equivalent, their predicted meansshould be the same for each language. And they are:


> ## compare old contrasts and new contrasts
> ## English: old model
> fixef(m.old)[1] + sum(old.contrasts["English",] * fixef(m.old)[2:3])
(Intercept)
-0.01402749
> ## English: new model
> fixef(m.new)[1] + sum(new.contrasts["English",] * fixef(m.new)[2:3])
(Intercept)
-0.01402749

The same will come out to be the case for the other two languages.

So -- to get back to your question: what do the nativeLanguage1 andnativeLanguage2 coefficients mean in your new model? First, yourcontrast matrix has columns summing to 0, so the intercept can looselybe thought of as the predicted grand mean. The coefficient fornativeLanguage1 is the difference between (a) the intercept and theEnglish mean, and (b) twice the difference between the intercept andthe average of the Chinese and Korean means. The coefficient fornativeLanguage2 is the difference between Chinese and Korean dividedby the square root of two. So your guess was basically correct. Butit is important to recognize that these two coefficients operate ondifferent scales, as reflected by the fact that the two columns ofnew.contrasts are vectors of different lengths.

I would greatly appreciate any advice on how to interpretregressions after contrast coding, or pointers to appropriateresources on this topic!

So -- I wish I knew a really good reference on contrast coding. Thereis some useful information in Chambers & Hastie 1991, Section 2.3.2,and in Venables & Ripley 2002, Section 6.2. I think that Healy 2000("Matrices for Statistics") is a useful book that has some pertinentinformation. But if anyone out there knows a great reference forcontrast coding -- I'd love to hear it too!


Best

Roger

--

Roger Levy                      Email: [email protected]
Assistant Professor             Phone: 858-534-7219
Department of Linguistics       Fax:   858-534-4789
UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy








_______________________________________________
R-lang mailing list
[email protected]
http://pidgin.ucsd.edu/mailman/listinfo/r-lang

Re: [R-lang] Contrast Coding in R Regressions

Reply via email to