Re: [postgis-users] Points and Row Count Estimates

Paul Ramsey Wed, 01 Mar 2023 08:44:56 -0800


> On Mar 1, 2023, at 7:23 AM, Gareth Bell <gareth.b...@outlook.com> wrote:
> 
> Hi postgis-users,
> 
> Please may I have some guidance. I have been looking in to bad row count 
> estimates for a simple query, similar to the following:
>    SELECT area_id FROM area WHERE ST_INTERSECTS(area.polygon, ST_MakePoint(x, 
> y))
> 
> I have found that the row count estimate is always '1', even if the point 
> intersects multiple geometries and the actual row count is in the thousands. 
> Is this expected behaviour?
> 
> To troubleshoot, I have tried:
>    1. Manually running `VACUUM ANALYZE`
>    2. Ensuring that stats are returned by `__postgis_stats`
>    3. Trying different but similar functions such as `ST_Contains`
> 
> I have also found that the following selectivity function returns `0` for any 
> point geometry:
>    `__postgis_selectivity('area', 'polygon', ST_MakePoint(x, y))`


Yes, this is “expected” (the code is operating as written) though perhaps not 
correct. In general estimates will be estimates. The culpable line of code is 

                /* Add the pro-rated count for this cell to the overall total */
                total_count += cell_count * ratio;

Since we are usually checking things with area as our query keys, pro-rating 
coverage of our grid makes sense. 

Most cases of point-on-poly queries, the number of polys under a given point is 
one. So the selectivity of a single point is 1/N where N is the table size. 
Which converges nicely on zero.

You have some kind of fun oddball case with a lot of (presumably variable) 
polygon overlap such that this simplification doesn’t make sense for you. 
Unfortunately we don’t have any geometry typology to play with to special case 
points as the selectivity key. We just have the bounds. And we cannot 
reasonably treat a “zero size box” as a “point”, since a “zero size polygon” 
will presumably have zero selectivity and will also have a “zero size box". 

The “trouble” is that the grid is constructed quite deliberately knowing that 
we will be dong pro-rata calculations against it. We reduce the cell density in 
dimensions with low variability, and use pro-rating to get the correct values: 
if the variable is uniform and you cover half the variable range, you get half 
the expected results back. Once you stop pro-rating, by, for example, assigning 
all the cell’s contents to a “zero size box” you get a completely different 
flavour of wrong results.

We are trying to squeeze a lot of information out of a relatively small data 
structure. We get 30000 samples in an ANALYZE. That’s 30000 boxes, or 30000 * 
sizeof(float) * sizeof(float) bytes… 480K. That’s the raw sample! If we 
quantized that into a 1000x1000 uniform grid, we’d end up with 1Mb of “sample” 
grid data, our data structure would be larger than our inputs! 

So to try and squeeze more out of the grid, we make it non-uniform and apply 
pro-rata scaling in the selectivity estimation, but then we get the issue 
you’ve identified for zero dimension query inputs. 

If we had explicit column types (point, line, polygon) we could actually have 
different estimators for each, albeit at immense extra complexity. 

> After reading through 
> https://www.crunchydata.com/blog/indexes-selectivity-and-statistics, I think 
> a selectivity of `0` for a point might be the intended behaviour, but I'm not 
> sure -- based on the blog post (and reading through the PostGIS source), 
> spatial selectivity is the % of geometries covered by the search geometry. 
> When the search geometry is a point (which has no dimensions), none of the 
> geometry will be covered, giving a selectivity of `0`.

So, yes and no. It’s not really intended but it’s also not avoidable under the 
current constraints of the design. The “intent” would be to always have the 
selectivity estimate exactly match the actual returned rows, but that’s 
obviously not achievable. The question is where we get things wrong and by how 
much.

ATB,

P


> 
> This is where I've got to on the investigation so far. Could I have some 
> guidance where to look next please? In particular:
>    1. Is always returning a specificity of `0` for a point geometry expected 
> behaviour?
>    2. Is there a way to get more accurate row estimates for point-in-poly 
> lookups such as `ST_INTERSECTS(area.polygon, ST_MakePoint(x, y)`?
> 
> Many thanks in advance,
> Gareth
> _______________________________________________
> postgis-users mailing list
> postgis-users@lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users

_______________________________________________
postgis-users mailing list
postgis-users@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users

Re: [postgis-users] Points and Row Count Estimates

Reply via email to