I posted this to sci.stat.consult, only, before noting the same question had been separately posted to sci.stat.edu and sci.stat.math, too. - this is posted to the latter two.
On 16 Jan 2004 16:31:28 -0800, [EMAIL PROTECTED] (Koh Puay Ping) wrote: > Hi all, I have a question on logistic regression. > > When we are finding a multivariate model (using proc logistic, SAS), I > understand that we should perform univariate analysis first to > identify the variables with (Pr < 0.25) for multivariate modelling. Well. No. Univariate variable-screening for model-building has fundamental problems, unless the project is totally exploratory -- or you have a large surplus of cases. "Stepwise" selection of variables is not a respected technique, for most purposes. Pre-selection, using the univariate tests, eliminates some aspects of confounding, which may be good or bad, but it also means that you can't use the nominal statistical tests later on. You can see my stats-FAQ for comments and references. [ snip, description of stepwise entry; including incomplete description of incorporation of interactions.] > > I have done this but found that the fit of the model is still not very > good. But when i remove one of the independent variables that is found > to be highly significant (Pr< 0.0001, and largest difference of -2log > likelihood), the fit of the model improved greatly. So my questions > are: That is another mis-statement. You say that when you take out the variable that has the largest partial contribution to the "fit", that the "fit" is improved. That is a contradiction of statistical terms, since the *legitimate* indicator of fit, precisely, is that -2log term. So, I assume that you are picking up some other criterion of fit that is not so essential, such as "group assignment". Assigning the correct group is a extra statistical report that regression programs usually provide, these days; but it is not essential to the statistical part of the procedure. So, Yes, that can happen. If you were looking at a properly established regression equation, and happened to see this when you dropped on variable (for some reason), I would think that it is very likely to indicate that you have some outliers (say) or other distributional artifact, affecting the results. But you don't have a decent equation, from what you have said. Funny peripheral things like this are also common once you have over-fitted a data set. Since you got to this point by "stepwise," overfitting is another good chance. As I said, check my stats-FAQ; check those references; or you may want to use keywords in groups.google.com to check the sci.stat.* groups. -- Rich Ulrich, [EMAIL PROTECTED] http://www.pitt.edu/~wpilib/index.html "Taxes are the price we pay for civilization." . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
