Hi all,

I am having trouble with deriving maximum likelihood estimators for
parameters in a very simple (?) linear model. Presumably, I am missing some
very fundamental principles of (ML-) estimation theory (I think with respect
of some relation between degrees of freedom and ML-estimation), which I
would like to get pointed out to me... I've consulted many different books,
but I'm just unable to localize the pain. Since I'm not a mathematician,
merely a psychologist applying statistical methods, I hope you can forgive
me for any mistakes and lack of rigor.

Suppose I have the following model:

y_{i} = X b_{i} + e_{i},          (i = 1,...,n)                          (1)

where y_{i} is a m-vector of measured responses, X is a m \times p (e.g.
design-) matrix of full column rank, b_{i} is a p-vector, and e_{i} ~
N(0,Sigma) are identical and independent random variables (Sigma is assumed
to be nonsingular). Both b_{i}, i = 1,...,n and Sigma are unknown and have
to be estimated. For convenience let us put B = [b_{1},...,b_{n}], thus B is
p \times n. This is actually very much like the factor-analytic linear
model, since the b_{i}'s are rather like stochastic variables, but I would
like to think of them as a mathematical parameters (although it might be
hard to make a distinction, see the remark after question 7 at the bottom).
In this model the number of parameters is p*n for B and m*(m+1)/2 for Sigma,
the number of observed data points is n*m, so that n*m > p*n + m*(m+1)/2 <=>
n > m*(m+1)/(m-p)/2; we will assume that this is indeed the case. Assuming
that the b_{i}'s may indeed be interpreted as mathematical variables, the
likelihood function for this model, given the observations y_{1},...,y_{n}
is proportional to

L(B,Sigma) =  |Sigma|^{-n/2} exp{- Q(Sigma,B)/2}                         (2)

where
              n
Q(Sigma,B) = SUM [y_{i} - X b_{i}]'Sigma^{-1} [y_{i} - X b_{i}]
             i=1

where prime denotes transposition; or equivalently

Q(Sigma,B) = tr{Sigma^{-1}V(B)},

with

        n
V(B) = SUM [y_{i} - X b_{i}][y_{i} - X b_{i}]'
       i=1
     = [Y - XB][Y - XB]'                                                 (3)

where Y = [y_{1}...y_{n}] is the m \times n matrix with responses.
Taking derivatives of L with respect to (the elements of) Sigma and equating
these to zero, results in the maximum likelihood estimate for Sigma:

est(Sigma) = n^{-1} V(B)

Substitution of this estimator into L results in the well known concentrated
likelihood:

M(B) = L(B,V(B)/n)
     = n^{-nm/2} |V(B)|^{-n/2} exp{-nm/2}
     = k |V(B)|^{-n/2}

with k = n^{-nm/2}exp{-nm/2}. Now the ML-estimate for B can be found by the
solving dM/dB = 0. This leads to the implicit equation for B

B = [X'V(B)^{-1}X]^{-1}X'V(B)^{-1}Y
(4)

I wonder if this is a standard result as well, and if so I would like to
have some reference, or an explanation of how to deal with these sorts of
equations.
However, since

M(B) = k |V(B)|^{-n/2}
     = k |[Y - XB][Y - XB]'|^{-n/2}
(5)

it seems obvious that some matrix norm of [Y - XB] should be minimized with
respect to B (but I'm not so sure of this because it would imply that the
ML-estimates of B are equal to the OLS estimates in the case of multivariate
normal errors). We could take est(B) = X^{+}Y, where X^{+} is the
Moore-Penrose inverse which results from minimizing the squared
Frobenius-norm tr{[Y - XB]'[Y - XB]}, this is the usual OLS estimate of the
model in equation (1). (Here X^{+} = (X'X)^{-1}X', since X has full column
rank.) If we substitute this estimate into equation (5), the (concentrated)
likelihood explodes, for if we take est(B) = X^{+}Y

M(est(B)) = k |[Y - XX^{+}Y][Y - XX^{+}Y]'|^{-n/2}
          = k |[I - P(X)]YY'[I - P(X)]'|^{-n/2}
(6)

where P(X) is the orthogonal projector onto the column-space of X, and so
the matrix
[I - P(X)]YY'[I - P(X)]' is singular. Thus we have found a matrix est(B)
which (wittingly) makes the estimated observed errors 'much more than
likely': its chances become greater than 1! I think this should also hold if
we take est(B) to be the righthand side of equation (4), which is the
maximum likelihood estimate of B, since equation (6) then becomes

M(est(B)) = k | [I - P(X,V(B))]YY'[I - P(X,V(B))]' |
(7)

where P(X,V(B)) is the orthogonal projector onto the column-space of X, with
respect to the innerproduct <a,b> = a'V(B)^{-1}b. But again, I'm not sure of
this, since V(B) depends on B, maybe someone can confirm or disconfirm this
and explain me why.

You might wonder why I want to have a maximum likelihood estimate if I
already have the OLS estimate, although the reason maybe also obvious: I
want to find confidence regions for the estimated parameters. Normally, for
the linear model this could be achieved using the the dispersion matrix of
the estimated parameters, which turns out to be

var(b_{i}) = (X'Sigma^{-1}X)^{-1}          ( i = 1,...,n )
(8)

however, since I don't know Sigma, it has to be estimated from the data,
using for instance the ML-estimate in the equation given after equation (3):

est(Sigma) = n^{-1} V(B)
           = n^{-1} [Y - XB][Y - XB]'

of which we concluded that it becomes singular if we substitute the OLS-
(and ML-?) estimate est(B) for B.

Can anybody help me understand where the reasoning is going awry? I can't
understand -- if the reasoning presented here is correct -- why the maximum
likelihood method apparently doesn't work in the case of this very simple
straightforward model.



I have one hunch of why this model won't work in ML-estimation (it is
certainly not an uncommon model though, at least in engineering
applications, but usually OLS is used and statistical aspects are ignored
completely). The more common standard linear model is

y_{i} = X b + e_{i}            (i = 1,...,n)
(9)

i.e. the b_{i} of the model in equation (1) are all equal. This is a special
case of the more general model that goes by the name of "growth-models" (in
the context of the MANOVA model):

Y = X B A + E
(10)

Here X is a matrix who's entries specify some sort of anticipated trend,
e.g. the coefficients of some standardized polynomials that acount for
linear, quadratic, cubic, etc. trends in the vector of responses y_{i} who's
entries are then typically repeated measurements of the same variable with
some time-lapse between them. The q \times n matrix A is a design matrix
that dummy-codes different conditions under which the measurements where
made, e.g. in

A = [ 1 .... 1 0 .... 0 ]
    [ 0 .... 0 1 .... 1 ]

the first row would encode those measurments in Y that where made from
experimental units in condition 1 and which where not, while the second row
would encode the measurements in Y of the experimental units in condition 2
and which where not. Now B is the p \times q matrix of coefficients which
have to be estimated and which specify the presence of the different trends
specified by the columns of X in the data, under the different conditions.
The matrix E is simply the m \times n matrix of the errors [e_{1}...e_{n}].
It is easily seen that the model in equation (9) is indeed a special case of
the growth-model in equation (10) by taking A = [1 .... 1] = 1(n), that is,
A is the vector of n ones. It can be shown by differentation of the
concentrated likelihood that the maximum likelihood estimate of B is given
by:

est(B) = (X'D^{-1}X)^{-1}X'D^{-1}Y A'(AA')^{-1}
(11)

where

D = Y[I - A'(AA')^{-1}A]Y' = Y[I - P(A')]Y'
(12)

>From equation (12) it can be seen imediately that D is guaranteed to be
singular as soon as  rank P(A') = n - q =< m, i.e. q >= n - m. Consequently,
the growth-model is not estimable if q >= n - m in this manner (but maybe in
some other way?). So unfortunately, A cannot be the n \times n identity
matrix. But in fact this surprises me, since I would expect it should be
possible to estimate p parameters in the model y = X b + e, if the dimension
of the respons vector m is much larger than p, and this should be possible
every time such a y is observed.
I guess the trouble lies in the covariance matrix that has to be
estimated -- this is another hunch I have: the ML-estimate of Sigma was
writen

est(Sigma) = n^{-1} [Y - XB][Y - XB]'

                     n
           = n^{-1} SUM [y_{i} - Xb_{i}][y_{i} - Xb_{i}]'
                    i=1

For each term in the summation p "mean-parameters" are estimated, which
(heuristically) leads to r = n - np = n(1-p) =< 0 degrees of freedom in the
Wishart distribution of est(Sigma)! In the growth model this would be r =
n - p(rank A) > 0 <=> rank A < n/p. Why doesn't this show up in the
ML-derivation? (I would like to hear WHAT I am doing wrong...).
Appart from this degrees of freedom problem, I now think its not justified
to consider the b_{i} to be mathematical variables, and should be considered
random; but then still, I can't see where it brakes down.

To summarize, these are the questions I would like to have answered:

1) I've never really understood this: in the present model n >
m*(m+1)/(m-p)/2, but apparently it's not the case that a model can be
estimated as soon as the number of observations exceeds the number of
parameters; is that just a necessary, but not sufficient condition? Does
this mean that the currently proposed estimators are inconsistent?

2) Is the implicit estimate in equation (4) a standard result? How does one
deal with this sort of equations?

3) Is minimizing the determinant in equation (5) -- thereby maximizing the
likelihood -- equivalent to minimizing the trace of the same matrix; and are
thus the ML-estimates of the parameters in a linear model under normally
distributed errors the same as the OLS estimates?

4) Is the reasoning around equation (7) correct -- i.e. does the
concentrated likelihood indeed explode if the ML-estimate of B given in eq.
(4) is substituted for B, just like it did for the OLS estimate of B?

5) Do there exist different methods of estimating the growth-model when the
matrix D in eq. (12) becomes singular?

6) What is the relation between degrees of freedom and ML-estimation? Or
rather: how do I know the parameters in a model can be estimated using the
ML-method? What essential assumptions does the ML-method make? Or is the
question of estimability of parameters completely independent of the method
of estimation (and if so WHY)?

7) Should the b_{i}'s be considered random, or may they be interpreted as
mathematical variables? It occures to me that putting the question this way
I should make clear what I mean with 'mathematical variable'. Mathematical
variables are often called 'fixed' in the comparison of 'fixed' vs. 'random'
variables in a statistical model. The term 'fixed' implies that there should
be something constant about them. Therefor I suppose it is not justified to
think of these b_{i}'s as mathematical variables, because by definition of
the model they can assume any value with each observation y_{i}, and so, if
there is anything 'constant' about them, then this would be in terms of the
ranges of values they assume, and which of those values are more likely --
i.e. in terms of parameters in a distribution function. But this is really a
mathematical question I'm not able to answer myself.

8) If Sigma where known, I expect that b in the model y = X b + e should be
estimable as long as p < m by taking the GLS estimate: est(b) =
(X'Sigma^{-1}X)^{-1}X'Sigma^{-1}y. And I expect this then should be possible
every time and observation y is done; collecting these: Y = XB + E can be
estimated by B = (X'Sigma^{-1}X)^{-1}X'Sigma^{-1}Y. Why doesn't this work if
Sigma has to be estimated?


Thanks for reading this far!

cheers,

Raoul.

--
############################
Raoul P.P.P. Grasman
Dept. of Psychology
University of Amsterdam
############################



--
############################
Raoul P.P.P. Grasman
Dept. of Psychology
University of Amsterdam
############################




===========================================================================
This list is open to everyone.  Occasionally, less thoughtful
people send inappropriate messages.  Please DO NOT COMPLAIN TO
THE POSTMASTER about these messages because the postmaster has no
way of controlling them, and excessive complaints will result in
termination of the list.

For information about this list, including information about the
problem of inappropriate messages and information about how to
unsubscribe, please see the web page at
http://jse.stat.ncsu.edu/
===========================================================================

Reply via email to