Re: [HACKERS] Floating point comparison inconsistencies of the geometric types

Kyotaro HORIGUCHI Thu, 17 Nov 2016 23:31:59 -0800

Hello,

At Mon, 14 Nov 2016 11:41:52 +0100, Emre Hasegeli <e...@hasegeli.com> wrote in 
<cae2gyzyjvrbj4bqzrp-ybo73-77-ldh0wnwptatzrfkpxyq...@mail.gmail.com>
> > What way to deal with it is in your mind? The problem hides
> > behind operators. To fix it a user should rewrite a expression
> > using more primitive operators. For example, (line_a # point_a)
> > should be rewritten as ((point_a <-> lineseg_a) < EPSILON), or in
> > more primitive way. I regared this that the operator # just
> > become useless.
> 
> Simple equations like this works well with the current algorithm:
> 
> > select '(0.1,0.1)'::point @ '(0,0),(1,1)'::line;


That's too simple for this discussion. If it satisfies someone's
requirement, a geometiric type with integer would go. Any
trigonometric calculations (by other than right angles) would
break it.

> The operator does what you expect from it.  Users can use something
> like you described to get fuzzy behaviour with an epsilon they choose.
> 
> > Regarding optimization, at least gcc generates seemingly not so
> > different code for the two. The both generally generates extended
> > code directly calling isnan() and so. Have you measured the
> > performance of the two implement (with -O2, without
> > --enable-cassert)?  This kind of hand-optimization gets
> > legitimacy when we see a siginificant difference, according to
> > the convention here.. I suppose.
> 
> I tested it with this program:
> 
> > int main()
> > {
> >    double  i,
> >            j;
> >    int     result = 0;
> >
> >    for (i = 0.1; i < 10000.0; i += 1.0)
> >        for (j = 0.1; j < 10000.0; j += 1.0)
> >            if (float8_lt(i, j))
> >                result = (result + 1) % 10;
> >
> >    return result;
> > }
> 
> The one calling cmp() was noticeable slower.
> 
> ./test1  0.74s user 0.00s system 99% cpu 0.748 total
> ./test2  0.89s user 0.00s system 99% cpu 0.897 total
> 
> This would probably be not much noticeable by calling SQL functions
> which are doing a few comparisons only, but it may be necessary to do
> many more comparisons on some places.  I don't find the optimised
> versions less clear than calling the cmp().  I can change it the other
> way, if you find it more clear.

Thanks for the measurment. I had similar numbers (14:13) for
separate code from Pg, but probably this difference is buried
under other steps. The law of stay-stupid says that this kind of
optimization should be consider only after it found to be
problematic, I think.


> > At least the comment you dropped by the patch,
> >
> >  int
> >  float4_cmp_internal(float4 a, float4 b)
> >  {
> > -       /*
> > -        * We consider all NANs to be equal and larger than any non-NAN. 
> > This is
> > -        * somewhat arbitrary; the important thing is to have a consistent 
> > sort
> > -        * order.
> > -        */
> >
> > seems very significant and should be kept anywhere relevant.
> 
> I will add it back on the next version.
> 
> > I seached pgsql-general ML but counldn't find a complaint about
> > the current behavior. Even though I'm not familar with PostGIS, I
> > found that it uses exactly the same EPSILON method with
> > PostgreSQL.
> 
> Is it?  I understood from Paul Ramsey's comment on this thread [1]
> that they don't.

I saw the repository of PostGIS by myself, but only in
box2d.c. Other objects have their own EPSILONs, EPSILOG_SQLMM,
DBL_EPSLON, FLT_EPSILON... Maybe that's all

|  * Copyright (C) 2008-2011 Paul Ramsey <pram...@cleverelephant.ca>
|  * Copyright (C) 2008 Mark Cave-Ayland <mark.cave-ayl...@siriusit.co.uk>
| 
| #ifndef EPSILON
| #define EPSILON        1.0E-06
| #endif
| #ifndef FPeq
| #define FPeq(A,B)     (fabs((A) - (B)) <= EPSILON)
| #endif

I agree that we should have our own tolerance method, and it migh
be an explicit one. But anyway we need any kind of tolerance on
comparing FP numbers. Removing such tolerance as preparation for
implementing another one is quite natural, but released version
should have any kind of tolerance.

> > If we had an apparent plan to use them for other than
> > earth-scale(?)  geometric usage, we could design what they should
> > be. But without such a plan it is just a breakage of the current
> > usage.
> 
> We give no promises about the geometric types being useful in earth scale.

So, we shouldn't just remove it.

> > About What kind of (needless) complication you are saying? The
> > fuzziness seems to me essential for geometric comparisons to work
> > practically. Addition to that, I don't think that we're not
> > allowed to change the behavior in such area of released versions
> > the time after time.
> 
> Even when it is a total mess?

Yes. The seemingly 'total-mess' is intended for a use with
lon/lat values and probably gave useful result at the time of
development. The mess is rather better than just ruined.

> > I don't think index scan and tolerant comparison are not
> > contradicting. Could you let me have an description about the
> > indexing capabilities and the inconsistencies?
> 
> The first problem is that some operators are not using the epsilon.
> This requires special treatment while developing index support for
> operators.  I have tried to support point for BRIN using the box
> operators, and failed because of that.
> 
> Comparing differences with epsilon is not applicable the same way to
> every operator.  Even with simple operators like "point in box" it
> covers different distances outside the box depending on where the
> point is.  For example, "point <-> box < EPSILON" wouldn't be
> equivalent with "point <@ box", when the point is outside corner of
> the box.  Things get more complicated with lines.  Because of these,
> we are easily violating basic expectations of the operators:

To keep such kind of integrity, we should deeply consider about
errors.

> > regression=# select '{1000,0.000001,0}'::line ?|| '{90000,0.00009,0}'::line;
> >
> > ?column?
> > ----------
> > f
> > (1 row)
> >
> > regression=# select '{90000,0.00009,0}'::line ?|| '{1000,0.000001,0}'::line;
> > ?column?
> > ----------
> > t
> > (1 row)

This kind of thing also appears on exact comparison with more
uncertain way. So we have the EPSILON so that the errors would be
reasonable at least on lon/lat geometrics. The values above are
far out of the scope of the EPSILON from the first. But exact
ones should be just unworkable.

> Another problem is lack of hash and btree operator classes.  In my
> experience, the point datatype is by far most used one.  People often
> try to use it on DISTINCT, GROUP BY, or ORDER BY clauses and complain
> when it doesn't work.  There are many complaints like this on the
> archives.  If we get rid of the epsilon, we can easily add those
> operator classes.

That's a fate of FP numbers. Btree is uasble for such data since
it is constructed using inequality operators but hash is almost
unsuitable without rounding and normalization because of the
nature of floating point numbers. Range scan on bree will work
find but on the other hand equality doesn't work even on btree
index from the same reason to the below.

DSITINCT, GROUP BY are just inpossible without somehow defining
the equiality of two points. Bare FP numbers doesn't fit since
generally they cannot match each other and should match based on
any resolutions of 0.000001, 0.1, 1, or 1000 according to
objective. I think probably those who complain so have different
resolution in their minds.

Again, FP numbers generally cannot match exactly each other. You
have to avoid that.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Floating point comparison inconsistencies of the geometric types

Reply via email to