Q1: Using many independent variables does not *cause* multicollinearity. These are simply different issues. It would lead to the problem of multiple comparisons, however, with 50k cases, it's hard to see that as a likely issue here. Given the number of cases, this must be observational research with archival data. That can lead to a number of issues if the researcher is trying to make causal claims. Some good background books for addressing such issues would be:
    - Shadish, Cook & Campbell, "Experimental & Quasi-experimental Designs"
    - Rosenbaum, "Observational Studies"
    - Pearl, "Causality"
    - Rothman, et al., "Modern Epidemiology"

Q2: It is quite reasonable to imagine that variables are often, in reality, related to each other in some way, and all significance tests do is assess the power of your study. With 2k cases, I would suspect that the estimate of the association may be fairly accurate (although this could depend of various factors regarding how the data were handled). The question of practical significance is an important one. An easy first step is to look at Kirk's famous paper, Kirk, "Practical significance; An idea whose time has come", a google scholar search could tell you about subsequent papers that cited that, and give you a start through the related literature.

Best,
Jeff


On 3/21/2012 4:59 AM, Iasonas Lamprianou wrote:
Dear all,
I have a question which can be expanded to the geeneral context of regression
modelling in general. If you feel that this question is beyond the scope of 
this list, please say so and I will apologize. However, this has to do with 
teaching.


Question 1:  I am revieweing a paper and the author uses a sample size of 
around 50,000 cases to run a logistic regression. He is using 22 independent
variables. Using too many independent variables may cause collinearity
problems. Beyond this, however, I am not aware of any other problems caused by
using too many variables in a model. However, this is also related to the 
problem of massively throwing tens of variables in amodel and then waiting for 
statistically significant results. Can anyone suggest relevant literature to 
give to my students to read?


Question 2: Some coefficients of a diffrent logistic model in the same paper are 
marginally significant e.g. b=-0.18 and se=0.08. The only reason this is signficant is 
because the researcher used in this model a large sample size (around two thousand cases 
N=2000). The lower bound of the confidence interval is almost zero. Can anyone suggest a 
good reference to say that in such a case we should also check the "practical 
significance" and since the lower bound is so close to zero, we should be careful on 
what we claim about the effect?

Thank you for your time
Jason

Dr. Iasonas Lamprianou
Department of Social and Political Sciences
University of Cyprus




        [[alternative HTML version deleted]]


-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2012.0.1913 / Virus Database: 2114/4882 - Release Date: 03/20/12

_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-teaching

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2012.0.1913 / Virus Database: 2114/4882 - Release Date: 03/20/12

_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-teaching

Reply via email to