Re: [HACKERS] Weighted Stats

On Sat, Mar 19, 2016 at 05:04:08PM +0100, Tomas Vondra wrote:
> Hi,
>
> On 03/19/2016 07:34 AM, David Fetter wrote:
> >On Fri, Mar 18, 2016 at 06:12:12PM -0700, Jeff Janes wrote:
> >>On Tue, Mar 15, 2016 at 8:36 AM, David Fetter <da...@fetter.org> wrote:
> >>>
> >>>Please find attached a patch that uses the float8 version to cover the
> >>>numeric types.
> >>
> >>Is there a well-defined meaning for having a negative weight?  If no,
> >>should it be disallowed?
> >
> >Opinions on this appear to vary.  A Wikipedia article defines weights
> >as non-negative, while a manual to which it refers only uses non-zero.
> >
> >https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Mathematical_definition
> >https://www.gnu.org/software/gsl/manual/html_node/Weighted-Samples.html
>
> I don't think that actually allows negative weights. It says that
>
>    w_i = 1/\sigma_i^2
>
> and variance is always > 0, thus w_i > 0. The zero is used as a special flag
> to remove the sample from the data set in a simple way.
>
> >I'm not sure which if either would be authoritative, but I could
> >certainly make up variants for each assumption.
> >
> >The assumption they have in common about weights is that a zero
> >weight is not part of the calculation, which assumption is
> >implemented in the previously submitted code.
>
> I think that if we're not sure what should happen with negative weights,
> then we should disallow them. It's easy to allow them later once we have a
> reasonable definition, but if we allow them now and later realize it should
> behave differently, we'll be in trouble because of breaking existing uses.

OK

> I can't really come up with a reasonable example that would actually use
> negative weights. Can you? That would probably help with defining the
> behavior correctly.

No, but I'm not a statistician.  I've seen them mentioned in contexts
that appear to be discussions among same, and again opinions vary.

> Allowing negative weights has other consequences. For example, what if
> sum(W) ends up being 0? For example
>
> CREATE TABLE t (a float, b float);
> INSERT INTO  t SELECT i,  1 FROM generate_series(1,1000) s(i);
> INSERT INTO  t SELECT i, -1 FROM generate_series(1,1000) s(i);
>
> SELECT weighted_avg(a,b) FROM t;
>  weighted_avg
> --------------
>           NaN
> (1 row)
>
> Is that the correct behavior? Why?

It's not, and you're right.

I will send a patch that disallows negative weights this evening or
tomorrow.  It will be slightly more complicated as I believe I will
need to create a new accumulator function for the weighted_avg() case
where I had been using an extant one before.

Cheers,
David.
--
David Fetter <da...@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fet...@gmail.com

Remember to vote!