On Fri, Mar 7, 2014 at 12:06 AM, <[email protected]> wrote: > On Thu, Mar 6, 2014 at 2:51 PM, Nathaniel Smith <[email protected]> wrote: >> On Wed, Mar 5, 2014 at 4:45 PM, Sebastian Berg >> <[email protected]> wrote: >>> >>> Hi all, >>> >>> in Pull Request https://github.com/numpy/numpy/pull/3864 Neol Dawe >>> suggested adding new parameters to our `cov` and `corrcoef` functions to >>> implement weights, which already exists for `average` (the PR still >>> needs to be adapted). >>> >>> The idea right now would be to add a `weights` and a `frequencies` >>> keyword arguments to these functions. >>> >>> In more detail: The situation is a bit more complex for `cov` and >>> `corrcoef` than `average`, because there are different types of weights. >>> The current plan would be to add two new keyword arguments: >>> * weights: Uncertainty weights which causes `N` to be recalculated >>> accordingly (This is R's `cov.wt` default I believe). >>> * frequencies: When given, `N = sum(frequencies)` and the values >>> are weighted by their frequency. >> >> I don't understand this description at all. One them recalculates N, >> and the other sets N according to some calculation? >> >> Is there a standard reference on how these are supposed to be >> interpreted? When you talk about per-value uncertainties, I start >> imagining that we're trying to estimate a population covariance given >> a set of samples each corrupted by independent measurement noise, and >> then there's some natural hierarchical Bayesian model one could write >> down and get an ML estimate of the latent covariance via empirical >> Bayes or something. But this requires a bunch of assumptions and is >> that really what we want to do? (Or maybe it collapses down into >> something simpler if the measurement noise is gaussian or something?) > > In general, going mostly based on Stata > > frequency weights are just a shortcut if you have repeated > observations. In my unit tests, the results is the same as using > np.repeat IIRC. The total number of observation is the sum of weights. > > aweights and pweights are mainly like weights in WLS, reflecting the > uncertainty of each observation. The number of observations is equal > to the number of rows. (Stata internally rescales the weights) > one explanation is that observations are measured with different > noise, another that observations represent the mean of subsamples with > different number of observations. > > there is an additional degrees of freedom correction in one of the > proposed calculations modeled after other packages that I never > figured out.
I found the missing proof http://stats.stackexchange.com/questions/47325/bias-correction-in-weighted-variance Josef > > (aside: statsmodels does not normalize the scale in WLS, in contrast > to Stata, and it is now equivalent to GLS with diagonal sigma. The > meaning of weight=1 depends on the user. nobs is number of rows.) > > no Bayesian analysis involved. but I guess someone could come up with > a Bayesian interpretation. > > I think the two proposed weight types, weights and frequencies, should > be able to handle almost all cases. > > Josef > >> >> -n >> >> -- >> Nathaniel J. Smith >> Postdoctoral researcher - Informatics - University of Edinburgh >> http://vorpus.org >> _______________________________________________ >> NumPy-Discussion mailing list >> [email protected] >> http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
