13 Assumptions in Regression Analysis

Mohammad Ehsanul Karim Thu, 12 Jun 2003 14:39:11 -0700

Hello all,

        First of all, i am very sorry for my "long" message and apologies for
cross-posting.
        I am working on some regression project and would love to know that
if there are any other assumptions of Regression analysis other than
the following unlucky 13 assumptions.
        Also please let me know if any of these are inappropriate as any
assumption in general in regression.
        And, is it possible to provide example (at least hypothetical or by
using any software packages) to each of these specific assumptions? If
yes, i would like to know how. Any helpful web-address on this topic
is highly appreciable.


        Please send a copy of your kind response in my e-mail address at
[EMAIL PROTECTED]

-----------------------------------------------------------
Assumptions that we make in Regression Analysis
-----------------------------------------------------------
        In most practical regression problems, researchers have only sample
data in their hand drawn from real or hypothetical population, and
based on the results derived from sample data; they are supposed to
give some decisions about the population. Therefore, it is important
that the researchers understand and know the nature of the population
which they are working with. They should know enough about the
population to portrait a suitable model for the particular situation.
However, defining each characteristic of a situation may produce a
much complex model. Sometimes, they do not need to make the model
perfect! But, then, how can a researcher isolate the occasions in
which their chosen model and data are sufficiently comparable for them
to proceed? Well; to ensure comparability (and to fulfill some other
requirements) we set some assumptions.
        The least squares fitting procedure can be used for data analysis as
a purely descriptive technique. However, the procedure has strong
theoretical justification if a few assumptions are made about how the
data are generated. In the context of scientific investigation,
"assumptions" are not just devices to simplify mathematics, they are
supposed to be a reasonable mathematical representation of the
data-generating process.

(a)     Assumption 1: Linear Model

        The population means of the values of the dependent variable Y at
each value of the independent variable X are assumed to be on a
straight line. That is, the Regression Model should be Linear in
Parameters.

(b)     Assumption 2: Non-random Value of Regressor

        Values taken by the Regressor X are considered non-random / fixed in
repeated sampling. This means our Regression Analysis is conditional
on the given value of X. In an experiment, the values of the
independent variable would be fixed by the experimenter and repeated
samples could be drawn with the independent variables fixed at the
same values in each sample. As a consequence of this assumption, the
independent variables will in fact be independent of the disturbance.

(c)     Assumption 3: Disturbance term with zero mean

        The mean value of the random disturbance term is zero for the given
value of X, that is, the conditional expected value of disturbance
term with respect to X is zero. This assumption assures that the
disturbance term, although makes individual Y values deviate from
fitted line, but as a whole (while summed up), it posses no affect on
the mean value of Y - by canceling out its positive values by its own
negative values.

(d)     Assumption 4: Disturbance term with constant variance

        The conditional variance of Disturbance term with respect to given X
values are same / identical / constant. In Statistics, we call this
phenomenon of equal variance as HOMOSCEDASTICITY. In contrast, the
condition of the error variance not being constant over all
observations is called heteroscedasticity. This heteroscedasticity is
inherent when the response in regression analysis follows a
distribution in which the variance is functionally related to
the mean or when there are several group with different variances.
        Our current assumption simply states that whenever X varies, the
corresponding Y populations have same variance around the Regression
line. By imposing this assumption, we give emphasis to the fact the
"All Y values corresponding to various X's are equally important."
        If the "Reliability" being judged by how closely the Y values are
distributed around their means; then all Y values corresponding to the
various X's will not be equally reliable if this assumption does not
hold good.

(e)     Assumption 5: No Autocorrelation

        For any two given X values the correlation between any two random
disturbance is zero, which means no autocorrelation should exist.By
this assumption, we make sure that we purely consider systematic
effect of X on Y (if any exist) and we do not want to get worried
about the influence that might act on Y as a result of possible
inter-correlations among the e's simultaneously, because it may make
the situation worse while defining relationships between Y and X.

(f)     Assumption 6: Stochastic Value of Regressor

        If the X's are stochastic, the disturbance term and the random X's
are independent or at least uncorrelated. This assumption is
particularly important to make the model simple to calculate. If X and
e are correlated, it is not possible to isolate / assess their
individual influences on Y. This is why, we assume X and e have
separate (additive in linear case) effects on Y.
        Notice that, this assumption is not necessary if X is fixed
(Assumption 2). The current assumption is a solution, only if the
assumption 2 is violated.

(g)     Assumption 7: Estimation Paradox

        The number of observations, say n, must be greater than the number of
"Parameters to be estimated", which in turns means- the number of
sample values must be greater than the number of regressors.

(h)     Assumption 8: Spread in Regressors

        There must be sufficient variability in the values taken by the
regressors.
        This assumption 8 and the previous one, the assumption 7, are
frequently overlooked by most text-book authors although these
innocent looking assumptions are very hard to handle if they are
violated. Violation of this Assumption 8 makes it impossible to
estimate regression parameters.

(i)     Assumption 9: Model Specification

        There should be no specification bias in the model used in empirical
analysis, that is, the regression model should be correctly specified
about
-       Variables
-       Functional form
-       Assumption
The validity of the model is also a question of high importance.
        However, in practice, investigator rarely knows the correct variable
to include the model or the correct functional form of the model or
the correct probabilistic assumptions about the variables entering the
model. Therefore, there is some "Trial and Error" involved in choosing
the right model for empirical situations. Then why we bother about
this assumption at all if judgment is required in selecting a model?
Well, this is just a reminder of the fact that, since our entire
regression analysis is based on the model - careful choice of model
should be used in formulating models, especially when there are
several competing theories.

(j)     Assumption 10: Multicollinearity

        There should be no exact or perfect linear relationships among the
independent variables. We assume that the independent variables are
linearly independent. That is, no independent variable can be
expressed as a (non-zero) linear combination of the remaining
independent variables. The failure of this assumption, known as
multicollinearity, clearly makes it infeasible to disentangle the
effects of the supposedly independent variables and provide poor
estimates of your regression coefficients. This assumption is
particularly applicable in multiple regression containing several
regressors.

(k)     Assumption 11: Normality assumption

        The Classical Normal Linear Regression model (CNLRM) specifically
assumes the t the stochastic disturbance term   (which represent the
combined influence on the dependent variable of a large number of
independent variables that are not explicitly introduced in the
regression model) is Normally distributed with- mean of disturbance
term 0, variance constant, covariance between two disturbance term is
zero. Here, we should note that, the only difference of Classical
Normal Linear Regression model (CNLRM) with Classical Linear
Regression model  (CLRM) is that CLRM only requires that the mean is
zero and the variance is a finite positive constant; no assumption or
restriction about the probability distribution of   is specified. The
readers who are wondering why we are talking about Normal distribution
in CNLRM, it is worth mentioning that the theoretical justification
for this extra assumption is the Central Limit theorem itself.
        Note that, normality of the predictors is not required in OLS, that
is, there are no assumptions regarding the distribution of the
independent variables in OLS.

(l)     Assumption 12: Scale of measurement

        Variables measured at interval or ratio levels. However, using other
statistical techniques, we can fix the possible violation of this
assumption.

(m)     Assumption 13: Error free predictor
        
        One of the assumptions of linear regression is that the independent
variable is measured without error. Just a philosophical question ...
Is it possible to measure anything without error? However, there is a
large literature on errors-in-variables regression which we like to
skip here.


----------
References
----------

Gujarati, D. (2003) Basic Econometrics 4th Edition. McGraw-Hill

Cohen, J. and Cohen, P. (1983) Applied Multivariate
Regression/Correlation Analysis for the Behavioral Sciences. Lawrence
Erlbaum

Draper, N. and Smith, H. (1998) Applied Regression Analysis 3rd
Edition. Wiley

Pedhazur, E. (1997) Multiple Regression in Behavioral Research.
Harcourt Brace

Pedhazur, E. and Pedhazur-Schmelkin, L. (1991) Measurement, Design,
and Analysis: an Integrated Approach. Lawrence Erlbaum
-----------------------------------------------------------

Thanks for your patience,



_______________________

Mohammad Ehsanul Karim <[EMAIL PROTECTED]>
Institute of Statistical Research and Training
University of Dhaka, Dhaka- 1000, Bangladesh
_______________________
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

13 Assumptions in Regression Analysis

Reply via email to