Charles R Harris wrote: > > > On Jan 8, 2008 7:48 PM, Robert Kern <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > Charles R Harris wrote: > > > Suppose you have a set of z_i and want to choose z to minimize the > > average square error $ \sum_i |z_i - z|^2 $. The solution is that > > $z=\mean{z_i}$ and the resulting average error is given by 2). > Note that > > I didn't mention Gaussians anywhere. No distribution is needed to > > justify the argument, just the idea of minimizing the squared > distance. > > Leaving out the ^2 would yield another metric, or one could ask > for a > > minmax solution. It is a question of the distance function, not > > probability. Anyway, that is one justification for the approach in 2) > > and it is one that makes a lot of applied math simple. Whether of > not a > > least squares fit is useful is different question. > > If you're not doing probability, then what are you using var() for? > I can accept > that the quantity is meaningful for your problem, but I'm not > convinced it's a > variance. > > > Lots of fits don't involve probability distributions. For instance, one > might want to fit a polynomial to a mathematical curve. This sort of > distinction between probability and distance goes back to Gauss himself, > although not in his original work on least squares. Whether or not > variance implies probability is a semantic question.
Well, the problem in front of us is entirely semantics: What does the string "var(z)" mean? Are we going to choose an mechanistic definition: "var(z) is implemented in such and such a way and interpretations are left open"? In that case, why are we using the string "var(z)" rather than something else? We're also still left with the question as to which such and such implementation to use. Alternatively, we can look at what people call "variances" and try to implement the calculation of such. In that case, the term "variance" tends to crop up (and in my experience *only* crop up) in statistics and probability. Certain implementations of the calculations of such quantities have cognates elsewhere, but those cognates are not themselves called variances. My question to you is, is "the resulting average error" a variance? I.e., do people call it a variance outside of S&P? There are any number of computations that are useful but are not variances, and I don't think we should make "var(z)" implement them. In S&P, the single quantity "variance" is well defined for real RVs, even if you step away from Gaussians. It's the second central moment of the PDF of the RV. When you move up to CC (or RR^2), the definition of "moment" changes. It's no longer a real number or even a scalar; the second central moment is a covariance matrix. If we're going to call something "the variance", that's it. The circularly symmetric forms are special cases. Although option #2 is a useful quantity to calculate in some circumstances, I think it's bogus to give it a special status. > I think if we are > going to compute a single number, 2) is as good as anything even if it > doesn't capture the shape of the scatter plot. A 2D covariance wouldn't > necessarily capture the shape either. True, but it is clear exactly what it is. The function is named "cov()", and it computes covariances. It's not called "shape_of_2D_pdf()". Whether or not one ought to compute a covariance is not "cov()"'s problem. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion