On Sep 15, 2009, at 9:16 PM, Rachel Baker wrote:

Hi,

I've recently started using R to do regressions, using the 'lmer' function. I am currently re-running some analyses that originally had treatment coding, so that they now have contrast coding. My question is about how to interpret contrast coded regression outputs.

One of my independent variables (nativeLanguage) has 3 levels: English, Chinese, and Korean. As this experiment was conducted in English, participants in the English group were native speakers, and participants in the other two groups were non-native speakers. In my original treatment-coded analysis, English was the reference level. My output for e.g. 'langCompare.lmer = lmer(duration~nativeLanguage+(1|Subject), data=myData)' had lines like:

Estimate Std. Error t value
nativeLanguageChinese              0.025920   0.002384  10.872
nativeLanguageKorean                -0.004416   0.002091  -2.112

As I understood it, such lines gave information about the comparison between Chinese and English, and between Korean and English, respectively.

I contrast coded this variable with the code: 'contrasts(myData $nativeLanguage) = c(-1, .5, .5)' (after ordering the levels: English, Chinese, Korean). This was in order to compare the native (English) group to the non-native (Chinese and Korean) groups. After this contrast coding, my output had lines like:

                                        Estimate Std. Error t value
nativeLanguage1               0.10002   0.010113  11.242
nativeLanguage2              -0.00046   0.639887  1.388

I was wondering how to interpret this output. My guess is that nativeLanguage1 is the comparison between the native and non-native groups, and native_language2 is the comparison between Chinese and Korean, but I haven't been able to find any resources to confirm this.

Hi Rachel,

Your guess is correct, but the situation may be a little more complicated than you think. First, you need to realize that you didn't specify a complete contrast. Here's a little code snippet to illustrate:

> m <- 20
> n <- 3
> lang <- factor(rep(c("English", "Chinese", "Korean"),m*n), levels=c("English", "Chinese", "Korean"))
> old.contrasts <- contrasts(lang)
> contrasts(lang) <- c(1,-.5,-.5)
> new.contrasts <- contrasts(lang)

Now, let's take a look at the old and new contrast matrices:

> old.contrasts
        Chinese Korean
English       0      0
Chinese       1      0
Korean        0      1
> new.contrasts
        [,1]          [,2]
English  1.0 -5.551115e-17
Chinese -0.5 -7.071068e-01
Korean  -0.5  7.071068e-01

The value of old.contrasts derives from the fact that by default, R uses contr.treatment for unordered factors, with the first level of the factor being the baseline (which for you is English, so that the contrast matrix is all zeroes in the English row):

> options()$contrasts
        unordered           ordered
"contr.treatment"      "contr.poly"

The value of contrasts(lang) reflects the fact that -- quoting from ? contrasts -- "If too few [entries for the contrast matrix] are supplied, a suitable contrast matrix is created by extending value after ensuring its columns are contrasts (orthogonal to the constant term) and not collinear."

Now let's generate some artificial data and look at how to interpret models fit using the old and new contrast matrices:

> set.seed(3)
> beta <- c(0,0.26,-0.004)
> speaker <- rep(1:m,langs*n)
> b <- rnorm(m,0,0.1)
> y <- beta[lang] + b[speaker] + rnorm(3*m*n)
> contrasts(lang) <- old.contrasts
> print(m.old <- lmer(y ~ lang + (1 | speaker),REML=F))
[...]
Fixed effects:
            Estimate Std. Error t value
(Intercept) -0.01403    0.13236 -0.1060
langChinese  0.49860    0.18719  2.6636
langKorean  -0.11447    0.18719 -0.6115
[...]
> contrasts(lang) <- new.contrasts
> print(m.new <- lmer(y ~ lang + (1 | speaker),REML=F))
[...]
Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.11402    0.07642   1.492
lang1       -0.12804    0.10807  -1.185
lang2       -0.43350    0.13236  -3.275
[...]

Ignoring speaker-specific effects, the predicted mean for a given language is the intercept plus the dot product of the language's contrast-matrix representation with the coefficients for the language factor. Since the two models are equivalent, their predicted means should be the same for each language. And they are:

> ## compare old contrasts and new contrasts
> ## English: old model
> fixef(m.old)[1] + sum(old.contrasts["English",] * fixef(m.old)[2:3])
(Intercept)
-0.01402749
> ## English: new model
> fixef(m.new)[1] + sum(new.contrasts["English",] * fixef(m.new)[2:3])
(Intercept)
-0.01402749

The same will come out to be the case for the other two languages.

So -- to get back to your question: what do the nativeLanguage1 and nativeLanguage2 coefficients mean in your new model? First, your contrast matrix has columns summing to 0, so the intercept can loosely be thought of as the predicted grand mean. The coefficient for nativeLanguage1 is the difference between (a) the intercept and the English mean, and (b) twice the difference between the intercept and the average of the Chinese and Korean means. The coefficient for nativeLanguage2 is the difference between Chinese and Korean divided by the square root of two. So your guess was basically correct. But it is important to recognize that these two coefficients operate on different scales, as reflected by the fact that the two columns of new.contrasts are vectors of different lengths.

I would greatly appreciate any advice on how to interpret regressions after contrast coding, or pointers to appropriate resources on this topic!

So -- I wish I knew a really good reference on contrast coding. There is some useful information in Chambers & Hastie 1991, Section 2.3.2, and in Venables & Ripley 2002, Section 6.2. I think that Healy 2000 ("Matrices for Statistics") is a useful book that has some pertinent information. But if anyone out there knows a great reference for contrast coding -- I'd love to hear it too!

Best

Roger

--

Roger Levy                      Email: [email protected]
Assistant Professor             Phone: 858-534-7219
Department of Linguistics       Fax:   858-534-4789
UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy








_______________________________________________
R-lang mailing list
[email protected]
http://pidgin.ucsd.edu/mailman/listinfo/r-lang

Reply via email to