Re: [PERFORM] slow joining very large table to smaller ones

Dan Harris Thu, 14 Jul 2005 15:39:22 -0700


On Jul 14, 2005, at 9:42 AM, John A Meinel wrote:



You might try giving it a little bit more freedom with:

EXPLAIN ANALYZE
SELECT recordtext FROM eventactivity, k_r, k_b
 WHERE eventactivity.incidentid = k_r.incidentid
   AND eventactivity.incidentid = k_b.incidentid
   AND k_r.id = 94
   AND k_b.id = 107
-- AND k_r.incidentid = k_b.incidentid
;

I'm pretty sure that would give identical results, just let theplanner

have a little bit more freedom about how it does it.
Also the last line is commented out, because I think it is redundant.

Ok, I tried this one. My ssh keeps getting cut off by a routersomewhere between me and the server due to inactivity timeouts, soall I know is that both the select and explain analyze are takingover an hour to run. Here's the explain select for that one, sincethat's the best I can get.

explain select recordtext from eventactivity,k_r,k_b whereeventactivity.incidentid = k_r.incidentid andeventactivity.incidentid = k_b.incidentid and k_r.id = 94 and k_b.id= 107;

                                                  QUERY PLAN

--------------------------------------------------------------------------------------------------------------

Merge Join  (cost=9624.61..4679590.52 rows=151009549 width=35)
   Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
   ->  Merge Join  (cost=4766.92..4547684.26 rows=16072733 width=117)
         Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")

-> Index Scan using eventactivity1 on eventactivity(cost=0.00..4186753.16 rows=46029271 width=49)

         ->  Sort  (cost=4766.92..4771.47 rows=1821 width=68)
               Sort Key: (k_b.incidentid)::text

-> Index Scan using k_b_idx on k_b(cost=0.00..4668.31 rows=1821 width=68)

                     Index Cond: (id = 107)
   ->  Sort  (cost=4857.69..4862.39 rows=1879 width=68)
         Sort Key: (k_r.incidentid)::text

-> Index Scan using k_r_idx on k_r (cost=0.00..4755.52rows=1879 width=68)

               Index Cond: (id = 94)
(13 rows)

You might also try:
EXPLAIN ANALYZE
SELECT recordtext
  FROM eventactivity JOIN k_r USING (incidentid)
  JOIN k_b USING (incidentid)
 WHERE k_r.id = 94
   AND k_b.id = 107
;


Similar results here.  The query is taking at least an hour to finish.

explain select recordtext from eventactivity join k_r using( incidentid ) join k_b using (incidentid ) where k_r.id = 94 andk_b.id = 107;


                                                 QUERY PLAN

--------------------------------------------------------------------------------------------------------------

Merge Join  (cost=9542.77..4672831.12 rows=148391132 width=35)
   Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
   ->  Merge Join  (cost=4726.61..4542825.87 rows=15930238 width=117)
         Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")

-> Index Scan using eventactivity1 on eventactivity(cost=0.00..4184145.43 rows=46000104 width=49)

         ->  Sort  (cost=4726.61..4731.13 rows=1806 width=68)
               Sort Key: (k_b.incidentid)::text

-> Index Scan using k_b_idx on k_b(cost=0.00..4628.92 rows=1806 width=68)

                     Index Cond: (id = 107)
   ->  Sort  (cost=4816.16..4820.82 rows=1863 width=68)
         Sort Key: (k_r.incidentid)::text

-> Index Scan using k_r_idx on k_r (cost=0.00..4714.97rows=1863 width=68)

               Index Cond: (id = 94)
(13 rows)

Also, if possible give us the EXPLAIN ANALYZE so that we know if the
planner is making accurate estimates. (You might send an EXPLAIN while
waiting for the EXPLAIN ANALYZE to finish)

You can also try disabling merge joins, and see how that changesthings.


Are there any negative sideaffects of doing this?

Well, postgres is estimating around 500 rows each, is that way off?Try
just doing:
EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;

And see if postgres estimates the number of rows properly.

I assume you have recently VACUUM ANALYZEd, which means you might need
to update the statistics target (ALTER TABLE k_b ALTER COLUMN
incidientid SET STATISTICS 100) default is IIRC 10, ranges from1-1000,
higher is more accurate, but makes ANALYZE slower.
Right now, it looks like pg is trying to sort the entireeventactivitytable for the merge join which is taking several minutes to do.Can Irephrase this so that it does the searching through k_r and k_bFIRSTand then go into eventactivity using the index on incidentid? Itseems
like that shouldn't be too hard to  make fast but my SQL query skills
are only average.
To me, it looks like it is doing an index scan (on k_b.id) through k_b
first, sorting the results by incidentid, then merge joining that with
eventactivity.

I'm guessing you actually want it to merge k_b and k_r to get extra
selectivity before joining against eventactivity.
I think my alternate forms would let postgres realize this. But ifnot,
you could try:

EXPLAIN ANALYZE
SELECT recordtext FROM eventactivity
 JOIN (SELECT incidentid FROM k_r JOIN k_b USING (incidentid)
    WHERE k_r.id = 94 AND k_b.id = 107)
USING (incidentid);


This one looks like the same plan as the others:

explain select recordtext from eventactivity join ( select incidentidfrom k_r join k_b using (incidentid) where k_r.id = 94 and k_b.id =107 ) a using (incidentid );

                                                  QUERY PLAN

--------------------------------------------------------------------------------------------------------------

Merge Join  (cost=9793.33..4693149.15 rows=156544758 width=35)
   Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
   ->  Merge Join  (cost=4847.75..4557237.59 rows=16365843 width=117)
         Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")

-> Index Scan using eventactivity1 on eventactivity(cost=0.00..4191691.79 rows=46084161 width=49)

         ->  Sort  (cost=4847.75..4852.38 rows=1852 width=68)
               Sort Key: (k_b.incidentid)::text

-> Index Scan using k_b_idx on k_b(cost=0.00..4747.24 rows=1852 width=68)

                     Index Cond: (id = 107)
   ->  Sort  (cost=4945.58..4950.36 rows=1913 width=68)
         Sort Key: (k_r.incidentid)::text

-> Index Scan using k_r_idx on k_r (cost=0.00..4841.30rows=1913 width=68)

               Index Cond: (id = 94)
(13 rows)

I don't know how selective your keys are, but one of these queries
should probably structure it better for the planner. It depends alot on
how selective your query is.

eventactivity currently has around 36 million rows in it. Thereshould only be maybe 200-300 incidentids at most that will be matchedwith the combination of k_b and k_r. That's why I was thinking Icould somehow get a list of just the incidentids that matched the id= 94 and id = 107 in k_b and k_r first. Then, I would only need tograb a few hundred out of 36 million rows from eventactivity.

If you have 100M rows, the above query looks like it expects k_r to
restrict it to 44M rows, and k_r + k_b down to 11M rows, which really
should be a seq scan (> 10% of the rows = seq scan). But if you are
saying the selectivity is mis-estimated it could be different.

Yeah, if I understand you correctly, I think the previous paragraphshows this is a significant misestimate.




---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] slow joining very large table to smaller ones

Reply via email to