Re: [SQL] Self-referencing table question

Edmund Bacon Thu, 24 Mar 2005 10:17:02 -0800

Sean Davis wrote:

Thanks. I thought about that a bit and it seems like it is highly likely to be expensive for a single query (though I should probably try it at some point). If I do find myself reformatting results after response to user input (i.e., reusing the query), though, then your solution is likely to be very useful.

You might think so, however, consider:

$ cat temptable.sql
 SELECT to_id
   INTO TEMP TABLE tids
   FROM correlation
  WHERE from_id  = 1234
  ORDER BY val DESC limit 100;

 SELECT t1.to_id AS from_id, t2.to_id
   INTO TEMP TABLE from_to
   FROM tids t1, tids t2
  WHERE t1.to_id > t2.to_id;

SELECT c.from_id, c.to_id, c.val FROM from_to JOIN correlation c USING(from_id, to_id) WHERE val > 0.5 order by from_id, to_id; $ cat subselect.sql select from_id, to_id, val from correlation where from_id in (select to_id from correlation where from_id = 1234 order by val desc limit 100) and to_id in (select to_id from correlation where from_id = 1234 order by val desc limit 100) and from_id > to_id and val > 0.5 order by from_id, to_id; $ psql -q -c "select count(1) from correlation" test count ---------- 64000000 (1 row)

$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m3.093s
user    0m0.000s
sys    0m0.010s
$ time psql -q -f temptable.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m0.238s
user    0m0.000s
sys    0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m1.945s
user    0m0.010s
sys    0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m1.949s
user    0m0.010s
sys    0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m1.953s
user    0m0.010s
sys    0m0.000s
$ time psql -q -f temptable.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m0.237s
user    0m0.020s
sys    0m0.000s
$

Note that the subselect version takes about 10 times as long as the temptable version, and does not seem to be dependent on what data might be cached.

I added the sort so that the results would be in the same order - you can see that the two query sets are producing the same output (at least md5sum thinks they are the same).

One thing to be aware of is the size of your returned data set - If it's fairly large, then the transfer time from your web-server to the pgsql box might overwhelm any "small" optimization in query time.

Sean
On Mar 24, 2005, at 11:13 AM, Edmund Bacon wrote:
Sometimes using a temp table is a better idea:
e.g.
-- start by creating a temp table 'tids' that hold the to_ids that
-- we are interested in.
 SELECT to_id
   INTO TEMP TABLE tids
   FROM correlation
  WHERE from_id  = 1234
  ORDER BY val DESC limit 100;
-- The following temp table makes use of the primary key on
-- the correlation table, and the stated goal from the original
-- question that:
--         from_id > to_id
--     and from_id in (tids.to_id)
--    and to_id in (tids.to_id)
 SELECT t1.to_id AS from_id, t2.to_id
   INTO TEMP TABLE from_to
   FROM tids t1, tids t2
  WHERE t1.to_id > t2.to_id;
-- Now we can use the from_to table as an index into the correlation
-- table.
 SELECT c.from_id, c.to_id, c.val
   FROM from_to
   JOIN correlation c USING(from_id, to_id)
  WHERE val > 0.5;
The explain analyze for the final select works out to: Nested Loop (cost=0.00..50692.00 rows=8488 width=16) (actual time=0.171..150.095 rows=2427 loops=1) -> Seq Scan on from_to (cost=0.00..79.38 rows=5238 width=8) (actual time=0.006..7.660 rows=4950 loops=1) -> Index Scan using correlation_pkey on correlation c (cost=0.00..9.63 rows=2 width=16) (actual time=0.024..0.025 rows=0 loops=4950) Index Cond: (("outer".from_id = c.from_id) AND ("outer".to_id = c.to_id)) Filter: (val > 0.5::double precision) Total runtime: 152.261 ms
Richard Huxton wrote:
Sean Davis wrote:
I answer my own question, if only for my own records. The following query is about 5-6 times faster than the original. Of course, if anyone else has other ideas, I'd be happy to hear them.
Sean
explain analyze select from_id,to_id,val from exprsdb.correlation where from_id in (select to_id from exprsdb.correlation where from_id=2424 order by val desc limit 100) and to_id in (select to_id from exprsdb.correlation where from_id=2424 order by val desc limit 100) and val>0.6 and to_id<from_id;
Might not be any faster, but you can do this as a self-join with subquery:
SELECT c1.from_id, c1.to_id, c1.val
FROM
  correlation c1,
  (
    SELECT to_id FROM correlation WHERE from_id=2424
    ORDER BY val DESC LIMIT 100
  ) AS c2
  (
    SELECT to_id FROM correlation WHERE from_id=2424
    ORDER BY val DESC LIMIT 100
  ) AS c3
WHERE
  c1.from_id = c2.to_id
  AND c1.to_id = c3.to_id
  AND c1.val > 0.5
  AND c1.to_id < from_id
;
I think PG should be smart enough nowadays to figure out these two queries are basically the same.
--
Edmund Bacon <[EMAIL PROTECTED]>
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
     subscribe-nomail command to [EMAIL PROTECTED] so that your
     message can get through to the mailing list cleanly


--
Edmund Bacon <[EMAIL PROTECTED]>


---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [SQL] Self-referencing table question

Reply via email to