Hey, for the following I make the hypothesis that 1. for you trip A->B should be separated from trip B->A. 2. od1.taz.geom are non overlaping polygon (ie your zone are non overlaping) 3. whatever startloc, endloc, there exist a zone overlapping it.
Now you querry look like this --compute the origin_destination matrix DROP TABLE IF EXISTS origin_destination_matrix ; CREATE TABLE origin_destination_matrix AS SELECT t.trip_id, zone1 AS starting_zone, zone2 as ending_zone FROM od1.trip_v1 as t INNER JOIN od1.taz AS zone1 ON ( ST_Intersects(zone1.geom ,t.startloc) = TRUE ) INNER JOIN od1.taz AS zone2 ON ( ST_Intersects(zone2.geom ,t.endloc) = TRUE ) ; --aggregate to get the number of trip from X to Y SELECT count() as number_of_trip_going_from, starting_zone, ending_zone FROM origin_destination_matrix GROUP BY starting_zone, ending_zone ; I would be supprised that you find something faster. Please note that if some of my hypothesis are false, you can still manage : if you drop 1. : the aggregate becomes SELECT count() as number_of_trip_going_from, l as starting , g as ending FROM origin_destination_matrix , LEAST(starting_zone,ending_zone) as l, GREATEST(starting_zone,ending_zone) as g GROUP BY l, g ; if you drop 2. : you need a way to choose which zone is the correct one when startloc or endloc is in several zone (example : the smallest area zone) the origin destination matrix computation becomes : SELECT DISTINCT ON ( t.trip_id) t.trip_id, zone1 AS starting_zone, zone2 as ending_zone FROM od1.trip_v1 as t INNER JOIN od1.taz AS zone1 ON ( ST_Intersects(zone1.geom ,t.startloc) = TRUE ) INNER JOIN od1.taz AS zone2 ON ( ST_Intersects(zone2.geom ,t.endloc) = TRUE ) ORDER BY ST_Area(zone1.geom) ASC, ST_Area(zone2.geom) ASC; If you drop 3. , it is easy to deal with. The dirty way is to add a polygon in "taz" that overlap all of you startloc/endloc, you can call it 'UNKNOWN', then use the "drop 2. hypothesis". The clean way is to look in origin destination matrix, and get missing trip_id (they are missing because they where no zone for startloc and/or endloc). Then do something for theim (I don"t know what you want to do with it) getting the missing trip_id can be done with an EXCEPT for instance : WITH missing_in_origin_destination AS ( --compute the missing trip, because startloc and/or endloc have no corresponding zone SELECT t.trip_id FROM od1.trip_v1 as t EXCEPT ALL SELECT odm.trip_id FROM origin_destination_matrix AS odm ) --your querry to do something with trip that are not in origin_destination matrix Cheers, Rémi-C 2015-02-11 9:37 GMT+01:00 Trang Nguyen <trang.ngu...@inrix.com>: > Hi Birgit, > > > > Thanks very much. > > I’m not familiar with the “with” clause but will look into that. There are > also indices on startts and z.uuid ( time range condition is ok, since > these are timestamps and not days). > > > > I am also finding that a cause of slowness was due to the OR condition: > ST_intersects(t.startloc, z.geom) > > or ST_intersects(t.endloc, z.geom) > > > > Splitting up the query into two unions, one using a where on > ST_intersects(t.startloc, > z.geom) and the second on ST_intersects(t.endloc, z.geom), but your > suggestion would get me closer to what I wanted in the final output. > > > > Best, > > Trang > > > > *From:* postgis-users-boun...@lists.osgeo.org [mailto: > postgis-users-boun...@lists.osgeo.org] *On Behalf Of *Birgit Laggner > *Sent:* Wednesday, February 11, 2015 12:20 AM > *To:* postgis-users@lists.osgeo.org > *Subject:* Re: [postgis-users] Question on optimizing slow geospatial > query > > > > Hi Trang, > > I think, it could help to create btree indices on "startts" and "uuid", > too, since you are using them in your where clause as a filter (a probably > unnecessary question regarding your date filter: I would expect the result > of "t.startts > '2015-01-16' and t.startts < '2015-01-17'" to be null > because there are no days between the 16th and the 17th of january - but > perhaps it was only an example...). And in general, my suggestion would be > to reduce the use of st_intersects to the necessary mimimum. > You could use the with clause for the filtration of your input data and > afterwards double join the two tables first on the startloc and second on > the endloc for the assignment of the origin and the destination zone. Then > group by origin and destination zones while counting your trips and you > should have your end result. > > Here is how I would imagine the query: > > with > t as (select trip_id, startloc, endloc from od1.trip_v1 where startts > between 'minimum start date' and 'maximum start date'), > z as (select zone from od1.taz where uuid in ('kansas_303', 'kansas_601', > 'kansas_603', etc)) > > select count(t.trip_id) as number_of_trips, orig.zone as orig_zone, > dest.zone as dest_zone from t left join z as orig on > st_intersects(t.startloc, z.geom) left join z as dest on > st_intersects(t.endloc, z.geom) group by orig.zone, dest.zone; > > > > However, I am not sure about how the gist indices work together with the > subselects of the with clause... > > Hope this helps, > > Birgit. > > > Am 11.02.2015 um 08:19 schrieb Trang Nguyen: > > Hi, > > > > I am a newbie to Postgres/PostGIS and have a long running query that I > would like to optimize. > > There are two tables (trip and zone) that I am joining in the query, one > which has “startloc” and “endloc” columns with type Geometry(Point) and > other which contains a Geometry(MultiPolygon). There are GIST indexes on > all above columns: > > > > CREATE TABLE od1.trip_v1 > > ( > > pkey bigint NOT NULL, > > trip_id character varying, > > startts timestamp without time zone, > > endts timestamp without time zone, > > startloc geometry(Point), > > endloc geometry(Point), > > … > > ) > > > > CREATE TABLE od1.taz > > ( > > uuid character varying NOT NULL, > > zone character varying NOT NULL, > > createdts timestamp without time zone NOT NULL, > > updatedts timestamp without time zone NOT NULL, > > geom geometry(MultiPolygon) NOT NULL, > > CONSTRAINT taz_pkey PRIMARY KEY (uuid) > > ) > > > I’m interested in building a matrix that, for a given set of input zones, > returns trips along with their start and end zones. Output looks like: > > > > 10 trips that start at Zone A, ends at Zone B > > 2 trips that start at Zone A, ends at Zone C > > 9 trips that start at Zone A, ends at other > > 13 trips that start at Zone C, ends at Zone D > > > > Since I am dealing with a large dataset (> 24 million records and > growing), I was planning on writing a query that returns the trips grouped > by each zone along with match condition (start, end or both) and doing > aggregation on the client layer. I’m not sure whether this is the best > approach but I expect that otherwise, I would end up having to write a very > complex query to handle that type of aggregation. > > > > Even so, the current query is very slow with very high cost: > > > > SELECT t.trip_id, > > case > > when ST_intersects(t.startloc, z.geom) and > ST_intersects(t.endloc, z.geom) then 'orig-dest' > > when ST_intersects(t.startloc, z.geom) then 'orig' > > when ST_intersects(t.endloc, z.geom) then 'dest' > > else 'none' > > end as match_cond, > > z.zone from od1.trip_v1 t, od1.taz z > > where t.startts > '2015-01-16' and t.startts < '2015-01-17' > > and z.uuid in ('kansas_303', 'kansas_601', 'kansas_603', etc) > > and ST_intersects(t.startloc, z.geom) > > or ST_intersects(t.endloc, z.geom) > > group by z.zone, t.trip_id, match_cond; > > > > Explain plan: > > "Group (cost=231446695055.73..245971533247.59 rows=14240037443 > width=3498)" > > " -> Sort (cost=231446695055.73..231482295149.34 rows=14240037443 > width=3498)" > > " Sort Key: z.zone, t.trip_id, (CASE WHEN ((t.startloc && z.geom) > AND _st_intersects(t.startloc, z.geom) AND (t.endloc && z.geom) AND > _st_intersects(t.endloc, z.geom)) THEN 'orig-dest'::text WHEN ((t.startloc > && z.geom) AND _st_intersects(t.startloc, (...)" > > " -> Nested Loop (cost=91.70..14401634128.24 rows=14240037443 > width=3498)" > > " -> Seq Scan on taz z (cost=0.00..739.19 rows=4619 > width=3406)" > > " -> Bitmap Heap Scan on trip_v1 t (cost=91.70..4151.26 > rows=453 width=107)" > > " Recheck Cond: ((startloc && z.geom) OR (endloc && > z.geom))" > > " Filter: (((startts > '2015-01-16 00:00:00'::timestamp > without time zone) AND (startts < '2015-01-17 00:00:00'::timestamp without > time zone) AND ((z.uuid)::text = ANY > ('{kansas_303,kansas_601,kansas_603,kansas_604,kansas_10,kansas_11,kan > (...)" > > " -> BitmapOr (cost=91.70..91.70 rows=2706 width=0)" > > " -> Bitmap Index Scan on idx_trip_v1_startloc > (cost=0.00..45.74 rows=1353 width=0)" > > " Index Cond: (startloc && z.geom)" > > " -> Bitmap Index Scan on idx_trip_v1_endloc > (cost=0.00..45.74 rows=1353 width=0)" > > " Index Cond: (endloc && z.geom)" > > > > Some help or suggestions on how to speed up the query would be much > appreciated. > > Also, I currently don’t have a specific map projection defined on the geom > columns so they are using the default of 0 in postgis. The points in both > trip and zone geometries are lon/lat. Is this an issue for the ST_intersect? > > > > Thanks, > Trang > > > > > > > _______________________________________________ > > postgis-users mailing list > > postgis-users@lists.osgeo.org > > http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users > > > > _______________________________________________ > postgis-users mailing list > postgis-users@lists.osgeo.org > http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users >
_______________________________________________ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users