Re: [PERFORM] Why hash join instead of nested loop?

Havasvölgyi Ottó Fri, 05 Aug 2005 15:16:17 -0700

Rhett,

Please post the explain analyze for both queries. From that we can see thepredicted and the actual costs of them.


Regards,
Otto

----- Original Message -----From: "Rhett Garber" <[EMAIL PROTECTED]>

To: <[email protected]>
Sent: Friday, August 05, 2005 8:35 PM
Subject: [PERFORM] Why hash join instead of nested loop?


I've got similiar queries that I think should be evaluated (as
displayed through 'explain') the same, but don't.
Hopefully this is the rigth place to send such a question and one of
you can help explain this to me.

The Tables:
  Connection  - 1.2 million entries, about 60 megs, 3 integer fields
that link two tables together (i.e. an identifier and two foreign
keys).
                        has an index on the identifier and either of
the foreign keys.
  rtmessagestate - very small, 5 entries
  rtmessage - pretty big, 80,000 entries


The Queries:

select rtmessagestate.* from rtmessagestate, connection where
connection_registry_id = 40105 and obj1 = 73582 and obj2 =
rtmessagestate.id;
     returns 1 in 13.7 ms

select rtmessage.id, subject from rtmessage, connection where
connection_registry_id = 40003 and obj1 = 4666 and obj2 =
rtmessage.id;
     returns 12 in 2 ms

Some more possibly important details:
  entries in Connection with connection_registry_id = 40105: 30,000
    entries with this id and obj1 = 73582: 1
  entries in Connection with connection_registry_id = 40003: 6,000
    entries with this id and obj1 = 4666: 20

but as I said before, there is an btree index on (connection_registry_id,obj1)


Explain:
The first query, breaks down as:
Hash Join  (cost=5.96..7.04 rows=1 width=14)
      Hash Cond: ("outer".id = "inner".obj2)
         ->  Seq Scan on rtmessagestate  (cost=0.00..1.05 rows=5 width=14)
      ->  Hash  (cost=5.96..5.96 rows=1 width=4)
            ->  Index Scan using connection_regid_obj1_index on
connection  (cost=0.00..5.96 rows=1 width=4)
                  Index Cond: ((connection_registry_id = 40105) AND
(obj1 = 73582))(6 rows)


While the second query is:
Nested Loop  (cost=0.00..11.62 rows=2 width=38)
      ->  Index Scan using connection_regid_obj1_index on connection
(cost=0.00..5.96 rows=1 width=4)

Index Cond: ((connection_registry_id = 40003) AND (obj1 =4666))

      ->  Index Scan using rtmessage_pkey on rtmessage
(cost=0.00..5.65 rows=1 width=38)
            Index Cond: ("outer".obj2 = rtmessage.id)
            (5 rows)

Actually running these queries shows that the second one (nested loop)
is much faster than the hash join, presumably because of hash startup
costs. Any ideas how I can make them both use the nested loop. I
assume that this would be the fastest for both.

Oddly enough, running the 1st query (rtmessagestate) as two queries or
with a sub query is way faster than doing the join.

And yes, I realize this schema may not be the most efficient for these
examples, but it seems to be the most flexible. I'm working on some
schema improvements also
but if I could understand why this is slow that woudl probably help also.

Thanks for you help,

Rhett

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq




---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Re: [PERFORM] Why hash join instead of nested loop?

Reply via email to