So, this is rather hard to explain without pictures, but you have to first of 
all recognize that the selectivity estimator is an "estimator", so you cannot 
construct synthetic situations and expect things to match exactly. You can just 
expect them to be worse or better.

In particular, things like "how many points are selected by this box" are 
little fraught.

First, understand that underlying the spatial stats system is a (in the 2d 
case) a simple 2d grid. The database draws a sample and fills in the counts of 
features lying in each grid cell. Simple, yes? Except that spatial data sets 
can get quite big. There's definitely an extrema at which you don't want to 
make a grid any larger, for fear of making working with it too expensive. Right 
now we tend towards grids with about 20K cells at the highest end, and very few 
at the lowest end.

As the code currently stands, it's possible for very small input tables to 
generate grids of 1x1 cell... which is what your case generates with the 
current code. In that case, the selectivity will end up being the proportion of 
the cell that the query box overlaps. So you can get (as you did) a selectivity 
of about 5% when the query box neatly covered 80% of the input rows.

To an approxiimation, this can be improved by tweaking the analysis code to 
increase the number of cells in the histogram for small tables. However, the 
selectivity calculations are still necessarily proportional on area of cell 
overlapped, and in your example data, for, say a 2x2 grid, which does capture 
the outlying point in its own cell, the cell boundary will be 1/2 way between 
the extrema, so your first small query box will only pick up a selectivity of 
0.2. Better than 0.05! But still not really right.

The reasoning behind having the shrinking dimensions will strike you as odd: 
why not just a minimum 100x100 grid? Well, that goes down to our (my) 
philosophy of trying to extract as much analytical power out of each cell as 
possible. Basically, your data is quite variable on the X axis and not really 
variable on the Y axis... should we allocate as many cells in X in our stats 
grid as we do in Y? If you answer "no", then you're off down the rabbit hole 
with me: for each dimension, statsitically figure out how much variability 
there is in coverage, and then parcel out grid cells in each dimension to 
allocate the budget to the most variable dimensions, and less to homogeneous 
dimensions. This means that, in the limit, with a data set that is quite 
variable in X and quite homogenous in Y, you could end up with a 1x20000 grid, 
with lots of resolution in the axis we care about and none in the homogeneous 
axis. 

Very clever, but lots of moving parts, and in the limit as the number if 
incoming records goes down, not going to yield estimates that are as close to 
"reality" as we might like. But in return it extracts more precise estimates in 
the case where the number of rows is very very large.

At least in theory. One possibility might be to push up the current maximum 
number of cells, and use some of that budget to put a higher floor on the 
minimum number of cells per dimension, so that the minimum isn't 1, as 
currently, but something more like 10, to cut down on the kind of divergence 
you're seeing in your small table case. I have some tweaks in hand that do that 
already, but now that I'm into the code I'm trying to dream up ways to cut it 
back while still maintaining a certain measure of dynamicism.

How much operational difference does the mis-estimate create? For a small 
table, I really wonder? Because the number of tuples will always be small, it's 
not like you're likely to be dropped into an index scan by accident, even if 
the seletivity is coming back too high.

I am way more interested in cases with larger numbers of rows generating 
mis-estimates. I assume that my code might also be vulnerable to tables with a 
lot of duplication, and also tables of points versus tables of polygons / lines.

ATB,
P

> On Dec 6, 2022, at 6:06 PM, Regina Obe <[email protected]> wrote:
> 
> Correction.  I was mistaken, I tested the wrong example for rows=1.
>  
> My ST_Contains returns the same answer as && 
>  
> explain analyze select * from test where 
> ST_Contains(ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))'), p);
>  
> Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual 
> time=0.025..0.027 rows=1 loops=1)
>   Filter: 
> st_contains('01030000000100000005000000000000000000494000000000000000000000000000004A4000000000000000000000000000004A4000000000000000400000000000004940000000000000004000000000000049400000000000000000'::geometry,
>  p)
>   Rows Removed by Filter: 4
> Planning Time: 0.121 ms
> Execution Time: 0.047 ms
>  
> So the fact that works for me and not you might be just dumb luck that I ran 
> the && first and the stats are cached and using that.
>  
> The difference in answer might also be if you modified your 
> default_statistics_target.
>  
> SHOW default_statistics_target;
>  
> Mine shows 100.
>  
> Thanks,
> Regina
>  
>  
> From: Regina Obe [mailto:[email protected]] 
> Sent: Tuesday, December 6, 2022 8:59 PM
> To: 'PostGIS Users Discussion' <[email protected]>
> Subject: RE: [postgis-users] Row estimations
>  
> Igor,
>  
> The stats estimation for PostGIS has always been not the greatest.  It’s more 
> of a limitation with PostgreSQL as I understand it.
> It’s been getting better over the years though.  Even in the best case 
> scenario only the bounding box estimation would happen and Paul may be better 
> able to answer what happens with the functions such as ST_Contains (which use 
> the Function instrumentation).
>  
> So a better test to confirm your stats are working correctly is an && test 
> which I have below.
>  
> I’m running  PostgreSQL 15.1, compiled by Visual C++ build 1914, 64-bit 
> POSTGIS="3.3.2 3.3.2" [EXTENSION] PGSQL="150" GEOS="3.11.1-CAPI-1.17.1" 
> SFCGAL="SFCGAL 1.4.1, CGAL 5.3, BOOST 1.78.0" PROJ="7.2.1" GDAL="GDAL 3.4.3, 
> released 2022/04/22" LIBXML="2.9.9" LIBJSON="0.12" LIBPROTOBUF="1.2.1" 
> WAGYU="0.5.0 (Internal)" TOPOLOGY RASTER
>  
> explain analyze select * from test where ST_GeomFromText('POLYGON((0 0,3 -0,3 
> 3,0 3,0 0))') &&  p;
>  
>  
> which give me the expected answer of  – 
>  
> Seq Scan on test  (cost=0.00..1.06 rows=1 width=32) (actual time=0.019..0.024 
> rows=4 loops=1)
>   Filter: 
> ('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry
>  && p)
>   Rows Removed by Filter: 1
> Planning Time: 0.131 ms
> Execution Time: 0.046 ms
>  
> And 4 rows are indeed returned for me.
>  
> explain analyze select * from test where ST_GeomFromText('POLYGON((50 0,52 
> 0,52 2,50 2,50 0))') && p;
>  
> Seq Scan on test  (cost=0.00..1.06 rows=1 width=32) (actual time=0.022..0.024 
> rows=1 loops=1)
>   Filter: 
> ('01030000000100000005000000000000000000494000000000000000000000000000004A4000000000000000000000000000004A4000000000000000400000000000004940000000000000004000000000000049400000000000000000'::geometry
>  && p)
>   Rows Removed by Filter: 4
> Planning Time: 0.124 ms
> Execution Time: 0.047 ms
>  
> I confirm that I get the same answer of below when doing
>  
> explain analyze select * from test where 
> ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))') , p);
>  
>  
> Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual 
> time=0.026..0.038 rows=4 loops=1)
>   Filter: 
> st_contains('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry,
>  p)
>   Rows Removed by Filter: 1
> Planning Time: 0.132 ms
> Execution Time: 0.061 ms
>  
>  
> So it does seem odd to me that ST_Contains would give a higher estimate than 
> the && operation.  That could be a bug in the function instrumentation of  
> ST_Contains and ST_Intersects.
>  
>  
> Thanks,
> Regina
>  
> From: postgis-users [mailto:[email protected]] On Behalf 
> Of Igor ALBUQUERQUE SILVA
> Sent: Thursday, December 1, 2022 4:37 AM
> To: [email protected]
> Subject: [postgis-users] Row estimations
>  
> Hello everyone,
>  
> I have a question regarding the row estimations/gist indexes. Here's a 
> minimal reproduction of it:
>  
> create table test(p geometry(point));
> insert into test(p) values (st_makepoint(1,1));
> insert into test(p) values (st_makepoint(1,2));
> insert into test(p) values (st_makepoint(2,1));
> insert into test(p) values (st_makepoint(2,2));
> insert into test(p) values (st_makepoint(51,1));
> analyze test;
> explain analyze select * from test where 
> ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))'), p);
> explain analyze select * from test where 
> ST_Contains(ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))'), p);
>  
> The two queries get the same cost/row estimation, of 1 row. This is the 
> EXPLAIN ANALYZE of the first query:
>  
> Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual 
> time=0.015..0.022 rows=4 loops=1)
>    Filter: 
> st_contains('01030000000100000005000000000000000000F0BF000000000000F0BF0000000000000040000000000000F0BF00000000000000400000000000000040000000000000F0BF0000000000000040000000000000F0BF000000000000F0BF'::geometry,
>  p)
>    Rows Removed by Filter: 1
>  Planning Time: 0.072 ms
>  Execution Time: 0.035 ms
> (5 rows)
>  
> What I was expecting is the first query to estimate 4 rows and the second to 
> estimate 1, like what I get If I try the same thing using integers.
>  
> create table test(x integer, y integer);
> insert into test(x, y) values (0, 0);
> insert into test(x, y) values (0, 1);
> insert into test(x, y) values (1, 0);
> insert into test(x, y) values (1, 1);
> insert into test(x, y) values (50, 0);
> analyze test;
> explain analyze select * from test where x between 0 and 1 and y between 0 
> and 1;
> explain analyze select * from test where x between 50 and 51 and y between 0 
> and 1;
>  
> My question is: is this expected behaviour? I actually have a much larger 
> table with a gist index where I found this occurring, and this causes the 
> planner to make bad decisions: every query that I do will have the same 
> estimation, and whenever this estimation is very wrong, the planner does not 
> take the optimal decision when it has to compare with other indexes costs in 
> a more complicated query.
>  
> I'm using the official docker image, PostgreSQL 15.1 (Debian 
> 15.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 
> 10.2.1 20210110, 64-bit POSTGIS="3.3.1 3786b21" [EXTENSION] PGSQL="150" 
> GEOS="3.9.0-CAPI-1.16.2" PROJ="7.2.1" LIBXML="2.9.10" LIBJSON="0.15" 
> LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)" TOPOLOGY
>  
> Best regards,
> Igor
> _______________________________________________
> postgis-users mailing list
> [email protected]
> https://lists.osgeo.org/mailman/listinfo/postgis-users

_______________________________________________
postgis-users mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/postgis-users

Reply via email to