Igor,
The stats estimation for PostGIS has always been not the greatest. It’s more
of a limitation with PostgreSQL as I understand it.
It’s been getting better over the years though. Even in the best case scenario
only the bounding box estimation would happen and Paul may be better able to
answer what happens with the functions such as ST_Contains (which use the
Function instrumentation).
So a better test to confirm your stats are working correctly is an && test
which I have below.
I’m running PostgreSQL 15.1, compiled by Visual C++ build 1914, 64-bit
POSTGIS="3.3.2 3.3.2" [EXTENSION] PGSQL="150" GEOS="3.11.1-CAPI-1.17.1"
SFCGAL="SFCGAL 1.4.1, CGAL 5.3, BOOST 1.78.0" PROJ="7.2.1" GDAL="GDAL 3.4.3,
released 2022/04/22" LIBXML="2.9.9" LIBJSON="0.12" LIBPROTOBUF="1.2.1"
WAGYU="0.5.0 (Internal)" TOPOLOGY RASTER
explain analyze select * from test where ST_GeomFromText('POLYGON((0 0,3 -0,3
3,0 3,0 0))') && p;
which give me the expected answer of –
Seq Scan on test (cost=0.00..1.06 rows=1 width=32) (actual time=0.019..0.024
rows=4 loops=1)
Filter:
('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry
&& p)
Rows Removed by Filter: 1
Planning Time: 0.131 ms
Execution Time: 0.046 ms
And 4 rows are indeed returned for me.
explain analyze select * from test where ST_GeomFromText('POLYGON((50 0,52 0,52
2,50 2,50 0))') && p;
Seq Scan on test (cost=0.00..1.06 rows=1 width=32) (actual time=0.022..0.024
rows=1 loops=1)
Filter:
('01030000000100000005000000000000000000494000000000000000000000000000004A4000000000000000000000000000004A4000000000000000400000000000004940000000000000004000000000000049400000000000000000'::geometry
&& p)
Rows Removed by Filter: 4
Planning Time: 0.124 ms
Execution Time: 0.047 ms
I confirm that I get the same answer of below when doing
explain analyze select * from test where
ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))') , p);
Seq Scan on test (cost=0.00..126.05 rows=1 width=32) (actual time=0.026..0.038
rows=4 loops=1)
Filter:
st_contains('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry,
p)
Rows Removed by Filter: 1
Planning Time: 0.132 ms
Execution Time: 0.061 ms
So it does seem odd to me that ST_Contains would give a higher estimate than
the && operation. That could be a bug in the function instrumentation of
ST_Contains and ST_Intersects.
Thanks,
Regina
From: postgis-users [mailto:[email protected]] On Behalf Of
Igor ALBUQUERQUE SILVA
Sent: Thursday, December 1, 2022 4:37 AM
To: [email protected]
Subject: [postgis-users] Row estimations
Hello everyone,
I have a question regarding the row estimations/gist indexes. Here's a minimal
reproduction of it:
create table test(p geometry(point));
insert into test(p) values (st_makepoint(1,1));
insert into test(p) values (st_makepoint(1,2));
insert into test(p) values (st_makepoint(2,1));
insert into test(p) values (st_makepoint(2,2));
insert into test(p) values (st_makepoint(51,1));
analyze test;
explain analyze select * from test where
ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))'), p);
explain analyze select * from test where
ST_Contains(ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))'), p);
The two queries get the same cost/row estimation, of 1 row. This is the EXPLAIN
ANALYZE of the first query:
Seq Scan on test (cost=0.00..126.05 rows=1 width=32) (actual time=0.015..0.022
rows=4 loops=1)
Filter:
st_contains('01030000000100000005000000000000000000F0BF000000000000F0BF0000000000000040000000000000F0BF00000000000000400000000000000040000000000000F0BF0000000000000040000000000000F0BF000000000000F0BF'::geometry,
p)
Rows Removed by Filter: 1
Planning Time: 0.072 ms
Execution Time: 0.035 ms
(5 rows)
What I was expecting is the first query to estimate 4 rows and the second to
estimate 1, like what I get If I try the same thing using integers.
create table test(x integer, y integer);
insert into test(x, y) values (0, 0);
insert into test(x, y) values (0, 1);
insert into test(x, y) values (1, 0);
insert into test(x, y) values (1, 1);
insert into test(x, y) values (50, 0);
analyze test;
explain analyze select * from test where x between 0 and 1 and y between 0 and
1;
explain analyze select * from test where x between 50 and 51 and y between 0
and 1;
My question is: is this expected behaviour? I actually have a much larger table
with a gist index where I found this occurring, and this causes the planner to
make bad decisions: every query that I do will have the same estimation, and
whenever this estimation is very wrong, the planner does not take the optimal
decision when it has to compare with other indexes costs in a more complicated
query.
I'm using the official docker image, PostgreSQL 15.1 (Debian 15.1-1.pgdg110+1)
on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110,
64-bit POSTGIS="3.3.1 3786b21" [EXTENSION] PGSQL="150" GEOS="3.9.0-CAPI-1.16.2"
PROJ="7.2.1" LIBXML="2.9.10" LIBJSON="0.15" LIBPROTOBUF="1.3.3" WAGYU="0.5.0
(Internal)" TOPOLOGY
Best regards,
Igor
_______________________________________________
postgis-users mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/postgis-users