FYI, I'm using the verb "rewind" to talk about using the negative transition aggregation function to get a prior value. I don't know if this is the right verb.
Conceptually, when aggregating over floating point numbers, there is some infinitely precise theoretical value, and the computation is approximating it. Call the infinitely precise value 'r'. Call the computed value 'c', which is the result of the aggregation function. (For float4_agg and float8_agg, typeof(c) == float8). The problem you have with rewinding an aggregate is that you don't know if you are getting the same value of c that you would have gotten from a rescan. But if you have a type that tracks a margin [min,max] where typeof(min) == typeof(max) is higher precision than typeof(c), then you can track: min <= r <= max By setting the rounding mode down, then up, when computing the next value of min and max, respectively. (Extra flag bits or booleans could track whether you've encountered +inf, -inf, Nan, and any other oddball cases, with corresponding special logic that has been discussed already upthread.) In many but not all cases: min != max but (typeof(c))min == (typeof(c))max Because the margin of error is small enough not to render different values when cast to the lower precision typeof(c). You could rewind the aggregation whenever this second case holds, and only force a rescan when it does not. This would render the same results for queries whether they were performed with rewinds or with rescans. The results might differ from older versions of postgres, but only in that they might be more accurate, with less accumulated rounding errors, owing to the higher precision state transition variable. For many modern platforms, typeof(min) could be __float128 using libquadmath, or something similar to that. If not available at compile time, it could be float64 instead. Even then, you'd still know that rewinding was possible when min == max and not otherwise, which is useful for cases of aggregation over exact values. I admit I've done a bit of handwaving on the computation of the margin and the handling of floating-point rounding issues, but I believe the implementation details are tractible. mark On Friday, January 10, 2014 10:10 AM, Florian Pflug <f...@phlo.org> wrote: On Jan10, 2014, at 18:14 , Kevin Grittner <kgri...@ymail.com> wrote: > Given that this is already the case with aggregates on floating > point approximate numbers, why should we rule out an optimization > which only makes rounding errors more likely to be visible? The > real issue here is that if you are using an approximate data type > and expecting exact answers, you will have problems. Because without the optimization, only the values which you *actually* process for a given result determine whether you lose precision or not. With the optimization, OTOH, values which have *nothing* to do with the result in question can nevertheless make it completely bogus. SUM() is a good example. As long as all your values are positive, the amount of precision you lose is bound by the number of input values. If I sum over 10 values, the worst that can happen is that the first values is large enough to prevent the other 9 values from influencing the result. That limits the relative error to something like 9*epsilon, where epsilon is the relative precision of the floating point type, i.e. 1e-15 or so for double. In other words, as long as your frames are less than 10e13 rows long, the relative error will stay below 1%. But with the optimization, that is no longer true. If you sum from, say, CURRENT ROW to UNBOUNDED FOLLOWING, the relative error of the result in one row now depends on the magnitude of values *preceding* that row, even though that value isn't in the frame. And since we now internally subtract, not only add, the relative error is no longer bound by the number of rows in the frame. Here's the corresponding SELECT (which is basically the same as Tom's example upthread): select n, x::float, sum(x::float) over ( order by n rows between current row and unbounded following ) from (values (1, 1e20), (2, 1), (3, 2) ) as t(n, x) order by n; Currently that returns n | x | sum ---+-------+------- 1 | 1e+20 | 1e+20 2 | 1 | 3 3 | 2 | 2 but with an inverse transfer function, it may very well return n | x | sum ---+-------+------- 1 | 1e+20 | 1e+20 2 | 1 | 0 3 | 2 | -1 > That's not to say that approximations are useless. If you > represent the circumference of the earth with a double precision > number you're dealing with an expected rounding error of about a > foot. That's close enough for many purposes. The mistake is > assuming that it will be exact or that rounding errors cannot > accumulate. In situations where SQL does not promise particular > ordering of operations, it should not be assumed; so any > expectations of a specific or repeatable result from a sum or > average of approximate numbers is misplaced. But this isn't about ordering, it's replacing one computation with a completely different one that just happens to be equivalent *algebraically*. To me, the proposed optimization for float is akin to C compiler which decided to evaluate a + b + c + … z as -a + (2a - b) + (2b - c) + … + (2y - z) + 2z Algebraically, these are the same, but it'd still be insane to do that. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers