> So I've was running this query for 866000 s (10 days) before I decided to
> kill it:
> One potential thing I've realized is that a few of the geometries in tazjan2
> are multipolygons, not single polygons. But it's only a few. There are a
> few very large and complex polygons in lancover_polygons_snap, but again,
> most of the 998031 are simple small polygons, about half would be
> ST_CoveredBy the polygons in tazjan2 and most of the rest would only overlap
> two or three of the polygons in tazjan2.
A common offender - the multipolygons.
You can get an immediate performance improvement by ST_Dump(wkb_geometry).geom
into a new table, with a new serial id, but retaining the original poly_id so
you can don’t lose the multipolygon provenance of each polygon.
Complex operations on the multipolygon table, like ST_Intersection(), are
extremely slow - Because the query has to access and compute the operation on
each of the geometry elements contained within the multi.
Another massive performance gain can be achieved by “Tiling” your dumped
polygon table - which has the effect of breaking your geometries into smaller
features, which you can then “union” back together again, after running
ST_Intersection on your vector tiles. Don’t try to tile your multi polygon
table - you’ll still be waiting days.
Take the coastline of Australia - a single multi polygon
SELECT Count(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as
num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11;
num_rows | num_geoms | num_points
-------------------+--------------
1 | 6718 | 1622598
If you need to query the intersection of a polygon with land, and the water,
its near impossible. Many, many days.
Hence we need to take a different approach.
The first thing you can do is to dump the multi polygon table.
Then create a regular vector grid.
Then tile the both your Polygon tables using the vector grid.
Put indexes on the resultant tiled tables.
Then run your ST_Intersects.
The result is many more geometries, and points, but now we get the benefit of
geometry indexes.
As a result of tiling Australia coastline, we now get
SELECT Count(tid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as
num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_tiled;
num_rows | num_geoms | num_points
-------------------+--------------
17222 | 29020 | 6513770
Each geometry also has a tile_id, and the original poly_id. The tile_id’s are
really useful for subsetting your queries, and for using tile-id’s as an
additional join condition. At any time you can rapidly rebuild your original
geometries doing ST_Union() GROUP BY poly_id.
Using the approach of dumping and tiling, queries that once took days now takes
minutes at most. And as Remi mentioned, you can parallelise the process. The
concept of vector tiling in PostGIS is analogous to Map Reduce and assigning
Key Value Pairs. Its about chunking your data, doing lots of fast operations
on the smaller bits, and then consolidating your outputs. You don’t
necessarily have to write a SQL function to do the SQL parallelisation. My
preference is Bash and GNU Parallel because you can write vanilla SQL and
execute via psql in parallel. For me its now about the speed I can write the
SQL.
So we’re now starting to apply Big Data concepts within PostGIS…. Holy
smoke… and you can apply the same concepts to a wide range of PostGIS
operations - scale up, scale out….
I will write up a tutorial explaining the benefits of vector tiling in PostGIS,
with examples and parallelisation code patterns. If not today, hopefully over
the next week.
Mark
_______________________________________________
postgis-users mailing list
[email protected]
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users