Dear all: I still need some further help since I think the question itself might be very interesting (i hope so:) : the question is on chisq.test, my concern is which criteria should be used here to evaluate the independence. The reason i use this old subject of the email is, b/c I think the problem here is about how to look at p.value, which evaluate the strength of evidence instead of association. If i am wrong, please correct me.
the result looks like this: index word.comb id in.class0 in.class1 p.value odds.ratio 1 1 TOTAL|LAID 54|241 2 4 0.0004997501 0.00736433 2 2 THEFT|RECOV 52|53 40751 146 0.0004997501 4.17127643 3 3 BOLL|ACCID 10|21 36825 1202 0.0004997501 0.44178546 4 4 LAB|VANDAL 8|55 24192 429 0.0004997501 0.82876099 5 5 VANDAL|CAUS 55|59 801 64 0.0004997501 0.18405918 6 6 AI|TOTAL 9|54 1949 45 0.0034982509 0.63766766 7 7 AI|RECOV 9|53 2385 61 0.0004997501 0.57547012 8 8 THEFT|TOTAL 52|54 33651 110 0.0004997501 4.56174408 the target is to look for best subset of word.comb to differentiate between class0 and class1. p.value is obtained via p.chisq.sim[i] <- as.double(chisq.test(tab, sim=TRUE, B=myB)$p.value) and B=20000 (I increased B and it won't help. the margin here is class0=2162792 class1=31859 ) So, in conclusion, which one I should use first, odds.ratio or p.value to find the best subset. I read some on feature selection in text categorization (A comparative study on feature selection in text categorization might be a good ref.). Anyone here has other suggestions? thanks, weiwei On 6/24/05, Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > On 6/24/05, Kjetil Brinchmann Halvorsen <[EMAIL PROTECTED]> wrote: > > Weiwei Shi wrote: > > > > >Hi, > > >I asked this question before, which was hidden in a bunch of > > >questions. I repharse it here and hope I can get some help this time: > > > > > >I have 2 contingency tables which have the same group variable Y. I > > >want to compare the strength of association between X1/Y and X2/Y. I > > >am not sure if comparing p-values IS the way even though the > > >probability of seeing such "weird" observation under H0 defines > > >p-value and it might relate to the strength of association somehow. > > >But I read the following statement from Alan Agresti's "An > > >Introduction to Categorical Data Analysis" : > > >"Chi-squared tests simply indicate the degree of EVIDENCE for an > > >association....It is sensible to decompose chi-squared into > > >components, study residuals, and estimate parameters such as odds > > >ratios that describe the STRENGTH OF ASSOCIATION". > > > > > > > > > > > Here are some things you can do: > > > > > tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2)) > > > > > tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2)) > > > library(epitools) # on CRAN > > > ?odds.ratio > > Help for 'odds.ratio' is shown in the browser > > > library(help=epitools) # on CRAN > > > tab1 > > [,1] [,2] > > [1,] 11266 2151526 > > [2,] 125 31734 > > > odds.ratio(11266, 125, 2151526, 31734) > > Error in fisher.test(tab) : FEXACT error 40. > > Out of workspace. # so this are evidently for tables > > with smaller counts > > > library(vcd) # on CRAN > > > > > ?oddsratio > > Help for 'oddsratio' is shown in the browser > > > oddsratio( tab1) # really is logodds ratio > > [1] 0.2807548 > > > plot(oddsratio( tab1) ) > > > library(help=vcd) # on CRAN Read this for many nice functions. > > > fourfoldplot(tab1) > > > mosaicplot(tab1) # not really usefull for this table > > > > Also has a look at function Crosstable in package gmodels. > > > > To decompose the chisqure you can program yourselves: > > > > decomp.chi <- function(tab) { > > rows <- rowSums(tab) > > cols <- colSums(tab) > > N <- sum(rows) > > E <- rows %o% cols / N > > contrib <- (tab-E)^2/E > > contrib } > > > > > > > decomp.chi(tab1) > > [,1] [,2] > > [1,] 0.1451026 0.0007570624 > > [2,] 9.8504915 0.0513942218 > > > > > > > So you can easily see what cell contributes most to the overall chisquared. > > > > Kjetil > > > > > > > > > > > > >Can I do this "decomposition" in R for the following example including > > >2 contingency tables? > > > > > > > > > > > >>tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2)) > > >>tab1 > > >> > > >> > > > [,1] [,2] > > >[1,] 11266 2151526 > > >[2,] 125 31734 > > > > > > > > > > > >>tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2)) > > >>tab2 > > >> > > >> > > > [,1] [,2] > > >[1,] 43571 2119221 > > >[2,] 52 31807 > > > > > > Here are a few more ways of doing this using chisq.test, > glm and assocplot: > > > ## chisq.test ### > > > tab1.chisq <- chisq.test(tab1) > > > # decomposition of chisq > > resid(tab1.chisq)^2 > [,1] [,2] > [1,] 0.1451026 0.0007570624 > [2,] 9.8504915 0.0513942218 > > > # same > > with(tab1.chisq, (observed - expected)^2/expected) > [,1] [,2] > [1,] 0.1451026 0.0007570624 > [2,] 9.8504915 0.0513942218 > > > > # Pearson residuals > > resid(tab1.chisq) > [,1] [,2] > [1,] 0.3809234 -0.02751477 > [2,] -3.1385493 0.22670294 > > > # same > > with(tab1.chisq, (observed - expected)/sqrt(expected)) > [,1] [,2] > [1,] 0.3809234 -0.02751477 > [2,] -3.1385493 0.22670294 > > > > ### glm ### > > # Pearson residuals via glm > > > tab1.df <- data.frame(count = c(tab1), A = gl(2,2), B = gl(2,1,4)) > > tab1.glm <- glm(count ~ ., tab1.df, family = poisson()) > > resid(tab1.glm, type = "pearson") > 1 2 3 4 > 0.38092339 -3.13854927 -0.02751477 0.22670294 > > plot(tab1.glm) > > > ### assocplot ### > > # displaying Pearson residuals via an assocplot > > assocplot(t(tab1)) > -- Weiwei Shi, Ph.D "Did you always know?" "No, I did not. But I believed..." ---Matrix III ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html