Re: 13 Assumptions in Regression Analysis

David Reilly Thu, 12 Jun 2003 11:36:09 -0700

[EMAIL PROTECTED] (Mohammad Ehsanul Karim) wrote in message news:<[EMAIL PROTECTED]>...
> Hello all,
> 
>       First of all, i am very sorry for my "long" message and apologies for
> cross-posting.
>       I am working on some regression project and would love to know that
> if there are any other assumptions of Regression analysis other than
> the following unlucky 13 assumptions.
>       Also please let me know if any of these are inappropriate as any
> assumption in general in regression.
>       And, is it possible to provide example (at least hypothetical or by
> using any software packages) to each of these specific assumptions? If
> yes, i would like to know how. Any helpful web-address on this topic
> is highly appreciable.
> 
>       Please send a copy of your kind response in my e-mail address at
> [EMAIL PROTECTED]
> 
> -----------------------------------------------------------
> Assumptions that we make in Regression Analysis
> -----------------------------------------------------------
>       In most practical regression problems, researchers have only sample
> data in their hand drawn from real or hypothetical population, and
> based on the results derived from sample data; they are supposed to
> give some decisions about the population. Therefore, it is important
> that the researchers understand and know the nature of the population
> which they are working with. They should know enough about the
> population to portrait a suitable model for the particular situation.
> However, defining each characteristic of a situation may produce a
> much complex model. Sometimes, they do not need to make the model
> perfect! But, then, how can a researcher isolate the occasions in
> which their chosen model and data are sufficiently comparable for them
> to proceed? Well; to ensure comparability (and to fulfill some other
> requirements) we set some assumptions.
>       The least squares fitting procedure can be used for data analysis as
> a purely descriptive technique. However, the procedure has strong
> theoretical justification if a few assumptions are made about how the
> data are generated. In the context of scientific investigation,
> "assumptions" are not just devices to simplify mathematics, they are
> supposed to be a reasonable mathematical representation of the
> data-generating process.
> 
> (a)   Assumption 1: Linear Model
> 
>       The population means of the values of the dependent variable Y at
> each value of the independent variable X are assumed to be on a
> straight line. That is, the Regression Model should be Linear in
> Parameters.
> 
> (b)   Assumption 2: Non-random Value of Regressor
> 
>       Values taken by the Regressor X are considered non-random / fixed in
> repeated sampling. This means our Regression Analysis is conditional
> on the given value of X. In an experiment, the values of the
> independent variable would be fixed by the experimenter and repeated
> samples could be drawn with the independent variables fixed at the
> same values in each sample. As a consequence of this assumption, the
> independent variables will in fact be independent of the disturbance.
> 
> (c)   Assumption 3: Disturbance term with zero mean
> 
>       The mean value of the random disturbance term is zero for the given
> value of X, that is, the conditional expected value of disturbance
> term with respect to X is zero. This assumption assures that the
> disturbance term, although makes individual Y values deviate from
> fitted line, but as a whole (while summed up), it posses no affect on
> the mean value of Y - by canceling out its positive values by its own
> negative values.
> 
> (d)   Assumption 4: Disturbance term with constant variance
> 
>       The conditional variance of Disturbance term with respect to given X
> values are same / identical / constant. In Statistics, we call this
> phenomenon of equal variance as HOMOSCEDASTICITY. In contrast, the
> condition of the error variance not being constant over all
> observations is called heteroscedasticity. This heteroscedasticity is
> inherent when the response in regression analysis follows a
> distribution in which the variance is functionally related to
> the mean or when there are several group with different variances.
>       Our current assumption simply states that whenever X varies, the
> corresponding Y populations have same variance around the Regression
> line. By imposing this assumption, we give emphasis to the fact the
> "All Y values corresponding to various X's are equally important."
>       If the "Reliability" being judged by how closely the Y values are
> distributed around their means; then all Y values corresponding to the
> various X's will not be equally reliable if this assumption does not
> hold good.
> 
> (e)   Assumption 5: No Autocorrelation
> 
>       For any two given X values the correlation between any two random
> disturbance is zero, which means no autocorrelation should exist.By
> this assumption, we make sure that we purely consider systematic
> effect of X on Y (if any exist) and we do not want to get worried
> about the influence that might act on Y as a result of possible
> inter-correlations among the e's simultaneously, because it may make
> the situation worse while defining relationships between Y and X.
> 
> (f)   Assumption 6: Stochastic Value of Regressor
> 
>       If the X's are stochastic, the disturbance term and the random X's
> are independent or at least uncorrelated. This assumption is
> particularly important to make the model simple to calculate. If X and
> e are correlated, it is not possible to isolate / assess their
> individual influences on Y. This is why, we assume X and e have
> separate (additive in linear case) effects on Y.
>       Notice that, this assumption is not necessary if X is fixed
> (Assumption 2). The current assumption is a solution, only if the
> assumption 2 is violated.
> 
> (g)   Assumption 7: Estimation Paradox
> 
>       The number of observations, say n, must be greater than the number of
> "Parameters to be estimated", which in turns means- the number of
> sample values must be greater than the number of regressors.
> 
> (h)   Assumption 8: Spread in Regressors
> 
>       There must be sufficient variability in the values taken by the
> regressors.
>       This assumption 8 and the previous one, the assumption 7, are
> frequently overlooked by most text-book authors although these
> innocent looking assumptions are very hard to handle if they are
> violated. Violation of this Assumption 8 makes it impossible to
> estimate regression parameters.
> 
> (i)   Assumption 9: Model Specification
> 
>       There should be no specification bias in the model used in empirical
> analysis, that is, the regression model should be correctly specified
> about
> -     Variables
> -     Functional form
> -     Assumption
> The validity of the model is also a question of high importance.
>       However, in practice, investigator rarely knows the correct variable
> to include the model or the correct functional form of the model or
> the correct probabilistic assumptions about the variables entering the
> model. Therefore, there is some "Trial and Error" involved in choosing
> the right model for empirical situations. Then why we bother about
> this assumption at all if judgment is required in selecting a model?
> Well, this is just a reminder of the fact that, since our entire
> regression analysis is based on the model - careful choice of model
> should be used in formulating models, especially when there are
> several competing theories.
> 
> (j)   Assumption 10: Multicollinearity
> 
>       There should be no exact or perfect linear relationships among the
> independent variables. We assume that the independent variables are
> linearly independent. That is, no independent variable can be
> expressed as a (non-zero) linear combination of the remaining
> independent variables. The failure of this assumption, known as
> multicollinearity, clearly makes it infeasible to disentangle the
> effects of the supposedly independent variables and provide poor
> estimates of your regression coefficients. This assumption is
> particularly applicable in multiple regression containing several
> regressors.
> 
> (k)   Assumption 11: Normality assumption
> 
>       The Classical Normal Linear Regression model (CNLRM) specifically
> assumes the t the stochastic disturbance term   (which represent the
> combined influence on the dependent variable of a large number of
> independent variables that are not explicitly introduced in the
> regression model) is Normally distributed with- mean of disturbance
> term 0, variance constant, covariance between two disturbance term is
> zero. Here, we should note that, the only difference of Classical
> Normal Linear Regression model (CNLRM) with Classical Linear
> Regression model  (CLRM) is that CLRM only requires that the mean is
> zero and the variance is a finite positive constant; no assumption or
> restriction about the probability distribution of   is specified. The
> readers who are wondering why we are talking about Normal distribution
> in CNLRM, it is worth mentioning that the theoretical justification
> for this extra assumption is the Central Limit theorem itself.
>       Note that, normality of the predictors is not required in OLS, that
> is, there are no assumptions regarding the distribution of the
> independent variables in OLS.
> 
> (l)   Assumption 12: Scale of measurement
> 
>       Variables measured at interval or ratio levels. However, using other
> statistical techniques, we can fix the possible violation of this
> assumption.
> 
> (m)   Assumption 13: Error free predictor
>       
>       One of the assumptions of linear regression is that the independent
> variable is measured without error. Just a philosophical question ...
> Is it possible to measure anything without error? However, there is a
> large literature on errors-in-variables regression which we like to
> skip here.
>


Assumption 3 has to be strengthened by asserting that the mean of the
residuals is zero everywhere or at least not-statisticall different
from zero everywhere ( i.e.all contiguous sub-sets )


a series that has a level shift can have a model whicg has a zero mean
for all the residuals but can have two offsetting means

......                         2
      .......                  1

overall the model y=1.5 + a(t) generates zero mean overall for the
residuals
but not LOCALLY  .



Not discussed BY YOU is the assumption that the parameters are
constant over the range of observation .

Gregory Chow's work on testing the hypothesis of a constant set of
parameters
underpins some of the work that we do in AUTOBOX to test for breal
points.

http://www.autobox.com

regards

dave reilly
AFS




>        
> ----------
> References
> ----------
> 
> Gujarati, D. (2003) Basic Econometrics 4th Edition. McGraw-Hill
> 
> Cohen, J. and Cohen, P. (1983) Applied Multivariate
> Regression/Correlation Analysis for the Behavioral Sciences. Lawrence
> Erlbaum
> 
> Draper, N. and Smith, H. (1998) Applied Regression Analysis 3rd
> Edition. Wiley
> 
> Pedhazur, E. (1997) Multiple Regression in Behavioral Research.
> Harcourt Brace
> 
> Pedhazur, E. and Pedhazur-Schmelkin, L. (1991) Measurement, Design,
> and Analysis: an Integrated Approach. Lawrence Erlbaum
> -----------------------------------------------------------
> 
> Thanks for your patience,
> 
> 
> 
> _______________________
> 
> Mohammad Ehsanul Karim <[EMAIL PROTECTED]>
> Institute of Statistical Research and Training
> University of Dhaka, Dhaka- 1000, Bangladesh
> _______________________
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: 13 Assumptions in Regression Analysis

Reply via email to