> So I've was running this query for 866000 s (10 days) before I decided to 
> kill it:


> One potential thing I've realized is that a few of the geometries in tazjan2 
> are multipolygons, not single polygons.  But it's only a few.  There are a 
> few very large and complex polygons in lancover_polygons_snap, but again, 
> most of the 998031 are simple small polygons, about half would be 
> ST_CoveredBy the polygons in tazjan2 and most of the rest would only overlap 
> two or three of the polygons in tazjan2.


A common offender - the multipolygons.

You can get an immediate performance improvement by ST_Dump(wkb_geometry).geom 
into a new table, with a new serial id, but retaining the original poly_id so 
you can don’t lose the multipolygon provenance of each polygon.
Complex operations on the multipolygon table, like ST_Intersection(), are 
extremely slow - Because the query has to access and compute the operation on 
each of the geometry elements contained within the multi.

Another massive performance gain can be achieved by “Tiling” your dumped 
polygon table - which has the effect of breaking your geometries into smaller 
features, which you can then “union” back together again, after running 
ST_Intersection on your vector tiles.  Don’t try to tile your multi polygon 
table - you’ll still be waiting days.

Take the coastline of Australia - a single multi polygon

SELECT Count(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as 
num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11;
 num_rows | num_geoms | num_points 
-------------------+--------------
        1 |      6718 |    1622598

If you need to query the intersection of a polygon with land, and the water, 
its near impossible.  Many, many days.

Hence we need to take a different approach.

The first thing you can do is to dump the multi polygon table.
Then create a regular vector grid.
Then tile the both your Polygon tables using the vector grid.
Put indexes on the resultant tiled tables.
Then run your ST_Intersects.
The result is many more geometries, and points, but now we get the benefit of 
geometry indexes.

As a result of tiling Australia coastline, we now get

SELECT Count(tid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as 
num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_tiled;
 num_rows | num_geoms | num_points 
-------------------+--------------
    17222 |     29020 |    6513770

Each geometry also has a tile_id, and the original poly_id.  The tile_id’s are 
really useful for subsetting your queries, and for using tile-id’s as an 
additional join condition.  At any time you can rapidly rebuild your original 
geometries doing ST_Union() GROUP BY poly_id.

Using the approach of dumping and tiling, queries that once took days now takes 
minutes at most.  And as Remi mentioned, you can parallelise the process.  The 
concept of vector tiling in PostGIS is analogous to Map Reduce and assigning 
Key Value Pairs.  Its about chunking your data, doing lots of fast operations 
on the smaller bits, and then consolidating your outputs.  You don’t 
necessarily have to write a SQL function to do the SQL parallelisation.   My 
preference is Bash and GNU Parallel because you can write vanilla SQL and 
execute via psql in parallel.   For me its now about the speed I can write the 
SQL.

So we’re now starting to apply Big Data concepts within PostGIS….    Holy 
smoke… and you can apply the same concepts to a wide range of PostGIS 
operations - scale up, scale out….

I will write up a tutorial explaining the benefits of vector tiling in PostGIS, 
with examples and parallelisation code patterns.  If not today, hopefully over 
the next week.  

Mark


_______________________________________________
postgis-users mailing list
[email protected]
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Reply via email to