Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

Mark Dilger Fri, 10 Jan 2014 12:35:25 -0800

FYI, I'm using the verb "rewind" to talk about using the negative transition 
aggregation function to get a prior value.  I don't know if this is the right 
verb.

Conceptually, when aggregating over floating point numbers, there is some 
infinitely precise theoretical value, and the computation is approximating it.  
Call the infinitely precise value 'r'.  Call the computed value 'c', which is 
the result of the aggregation function.  (For float4_agg and float8_agg, 
typeof(c) == float8).

The problem you have with rewinding an aggregate is that you don't know if you 
are getting the same value of c that you would have gotten from a
 rescan.  But if you have a type that tracks a margin [min,max] where 
typeof(min) == typeof(max) is higher precision than typeof(c), then you can 
track:

min <= r <= max

By setting the rounding mode down, then up, when computing the next value of 
min and max, respectively.  (Extra flag bits or booleans could track whether 
you've encountered +inf, -inf, Nan, and any other oddball cases, with 
corresponding special logic that has been discussed already upthread.)  In many 
but not all cases:

min != max

but

(typeof(c))min == (typeof(c))max

Because the margin of error is small enough not to render different values when 
cast to the lower precision typeof(c).

You could rewind the aggregation whenever this second case holds, and only 
force a rescan when it does not.  This would render the same results for 
queries whether they were performed with rewinds or with rescans.  The results 
might differ from older versions of postgres, but only in that they might be 
more accurate, with less accumulated rounding errors, owing to the higher 
precision state transition variable.

For many modern platforms, typeof(min)
 could be __float128 using libquadmath, or something similar to that.  If not 
available at compile time, it could be float64 instead.  Even then, you'd still 
know that rewinding was possible when min == max and not otherwise, which is 
useful for cases of aggregation over exact values.

I admit I've done a bit of handwaving on the computation of the margin and the 
handling of floating-point rounding issues, but I believe the implementation 
details are tractible.

mark

On Friday, January 10, 2014 10:10 AM, Florian Pflug <[email protected]> wrote:

On Jan10, 2014, at 18:14 , Kevin Grittner <[email protected]> wrote:
> Given that this is already the case with aggregates on floating
> point approximate numbers, why should we rule out an optimization
> which only makes rounding errors more likely to be visible?  The
> real issue here is that if you are using an approximate data type
> and expecting exact answers, you will have problems.

Because without the optimization, only the values which you 
*actually* process for a given result determine whether you lose
precision or not. With the optimization, OTOH, values which have
*nothing* to do with the result in question can nevertheless
make it completely bogus.

SUM() is a good example. As long as all your values are positive,
the amount of precision you lose is bound by the number of input
values. If I sum over 10 values, the worst that can happen is that
the first values is large enough to prevent the other 9 values from
influencing the result. That limits the relative error to something
like 9*epsilon, where epsilon is the relative precision of the
floating point type, i.e. 1e-15 or so for double. In other words,
as long as your frames are less than 10e13 rows long, the relative
error will stay below 1%.

But with the optimization, that is no longer true. If you sum
from, say, CURRENT ROW to UNBOUNDED FOLLOWING, the relative error
of the result in one row now depends on the magnitude of values
*preceding* that row, even though that value isn't in the frame.
And since we now internally subtract, not only add, the relative
error is no longer bound by the number of rows in the frame.

Here's the corresponding SELECT (which is basically the same
as Tom's example upthread):
select
   n,
   x::float,
   sum(x::float) over (
    order by n rows between current row and unbounded following
   )
  from (values
    (1, 1e20),
    (2, 1),
    (3, 2)
  ) as t(n, x)
  order by n;

Currently that returns
   n |   x   |  sum  
  ---+-------+-------
   1 | 1e+20 | 1e+20
   2 |     1 |     3
   3 |     2 |     2

but with an inverse transfer function, it may very well return
   n |   x   |  sum  
  ---+-------+-------
   1 | 1e+20 | 1e+20
   2 |     1 |     0
   3 |     2 |    -1

> That's not to say that approximations are
 useless.  If you
> represent the circumference of the earth with a double precision
> number you're dealing with an expected rounding error of about a
> foot.  That's close enough for many purposes.  The mistake is
> assuming that it will be exact or that rounding errors cannot
> accumulate.  In situations where SQL does not promise particular
> ordering of operations, it should not be assumed; so any
> expectations of a specific or repeatable result from a sum or
> average of approximate numbers is misplaced.

But this isn't about ordering, it's replacing one computation
with a completely different one that just happens to be equivalent
*algebraically*.

To me, the proposed optimization
 for float is akin to C compiler
which decided to evaluate
  a + b + c + … z
as
  -a + (2a - b) + (2b - c) + … + (2y - z) + 2z
Algebraically, these are the same, but it'd still be insane to
do that.

best regards,

Florian Pflug

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

Reply via email to