Re: [postgis-users] Row estimations

Regina Obe Tue, 06 Dec 2022 17:58:55 -0800

Igor,


The stats estimation for PostGIS has always been not the greatest.  It’s more 
of a limitation with PostgreSQL as I understand it.

It’s been getting better over the years though.  Even in the best case scenario 
only the bounding box estimation would happen and Paul may be better able to 
answer what happens with the functions such as ST_Contains (which use the 
Function instrumentation).

 

So a better test to confirm your stats are working correctly is an && test 
which I have below.

 

I’m running  PostgreSQL 15.1, compiled by Visual C++ build 1914, 64-bit 
POSTGIS="3.3.2 3.3.2" [EXTENSION] PGSQL="150" GEOS="3.11.1-CAPI-1.17.1" 
SFCGAL="SFCGAL 1.4.1, CGAL 5.3, BOOST 1.78.0" PROJ="7.2.1" GDAL="GDAL 3.4.3, 
released 2022/04/22" LIBXML="2.9.9" LIBJSON="0.12" LIBPROTOBUF="1.2.1" 
WAGYU="0.5.0 (Internal)" TOPOLOGY RASTER

 

explain analyze select * from test where ST_GeomFromText('POLYGON((0 0,3 -0,3 
3,0 3,0 0))') &&  p;

 

 

which give me the expected answer of  – 

 

Seq Scan on test  (cost=0.00..1.06 rows=1 width=32) (actual time=0.019..0.024 
rows=4 loops=1)

  Filter: 
('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry
 && p)

  Rows Removed by Filter: 1

Planning Time: 0.131 ms

Execution Time: 0.046 ms

 

And 4 rows are indeed returned for me.

 

explain analyze select * from test where ST_GeomFromText('POLYGON((50 0,52 0,52 
2,50 2,50 0))') && p;

 

Seq Scan on test  (cost=0.00..1.06 rows=1 width=32) (actual time=0.022..0.024 
rows=1 loops=1)

  Filter: 
('01030000000100000005000000000000000000494000000000000000000000000000004A4000000000000000000000000000004A4000000000000000400000000000004940000000000000004000000000000049400000000000000000'::geometry
 && p)

  Rows Removed by Filter: 4

Planning Time: 0.124 ms

Execution Time: 0.047 ms

 

I confirm that I get the same answer of below when doing

 

explain analyze select * from test where 
ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))') , p);

 

 

Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual time=0.026..0.038 
rows=4 loops=1)

  Filter: 
st_contains('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry,
 p)

  Rows Removed by Filter: 1

Planning Time: 0.132 ms

Execution Time: 0.061 ms

 

 

So it does seem odd to me that ST_Contains would give a higher estimate than 
the && operation.  That could be a bug in the function instrumentation of  
ST_Contains and ST_Intersects.

 

 

Thanks,

Regina

 

From: postgis-users [mailto:[email protected]] On Behalf Of 
Igor ALBUQUERQUE SILVA
Sent: Thursday, December 1, 2022 4:37 AM
To: [email protected]
Subject: [postgis-users] Row estimations

 

Hello everyone,

 

I have a question regarding the row estimations/gist indexes. Here's a minimal 
reproduction of it:

 

create table test(p geometry(point));
insert into test(p) values (st_makepoint(1,1));
insert into test(p) values (st_makepoint(1,2));
insert into test(p) values (st_makepoint(2,1));
insert into test(p) values (st_makepoint(2,2));
insert into test(p) values (st_makepoint(51,1));
analyze test;
explain analyze select * from test where 
ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))'), p);
explain analyze select * from test where 
ST_Contains(ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))'), p);

 

The two queries get the same cost/row estimation, of 1 row. This is the EXPLAIN 
ANALYZE of the first query:

 

Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual time=0.015..0.022 
rows=4 loops=1)
   Filter: 
st_contains('01030000000100000005000000000000000000F0BF000000000000F0BF0000000000000040000000000000F0BF00000000000000400000000000000040000000000000F0BF0000000000000040000000000000F0BF000000000000F0BF'::geometry,
 p)
   Rows Removed by Filter: 1
 Planning Time: 0.072 ms
 Execution Time: 0.035 ms
(5 rows)

 

What I was expecting is the first query to estimate 4 rows and the second to 
estimate 1, like what I get If I try the same thing using integers.

 

create table test(x integer, y integer);
insert into test(x, y) values (0, 0);
insert into test(x, y) values (0, 1);
insert into test(x, y) values (1, 0);
insert into test(x, y) values (1, 1);
insert into test(x, y) values (50, 0);
analyze test;
explain analyze select * from test where x between 0 and 1 and y between 0 and 
1;
explain analyze select * from test where x between 50 and 51 and y between 0 
and 1;

 

My question is: is this expected behaviour? I actually have a much larger table 
with a gist index where I found this occurring, and this causes the planner to 
make bad decisions: every query that I do will have the same estimation, and 
whenever this estimation is very wrong, the planner does not take the optimal 
decision when it has to compare with other indexes costs in a more complicated 
query.

 

I'm using the official docker image, PostgreSQL 15.1 (Debian 15.1-1.pgdg110+1) 
on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 
64-bit POSTGIS="3.3.1 3786b21" [EXTENSION] PGSQL="150" GEOS="3.9.0-CAPI-1.16.2" 
PROJ="7.2.1" LIBXML="2.9.10" LIBJSON="0.15" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 
(Internal)" TOPOLOGY

 

Best regards,

Igor

_______________________________________________
postgis-users mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/postgis-users

Re: [postgis-users] Row estimations

Reply via email to