> I can't get the following statement to complete with reasonable time.

Upgraded to 7.4.1, and realized that NOT IN is far more efficient than
IN, EXISTS or NOT EXISTS, at least for the amount and distribution of
data that I have. Here are some numbers from before and after performing
the problematic clean up operation:

                        | Before    | After
------------------------+-----------+-----------
COUNT(*)                | 6'104'075 | 6'104'075
COUNT(session)          | 5'945'272 | 3'640'659
COUNT(DISTINCT session) | 2'865'570 |   560'957

The following query completes within less than three hours on a machine
with a high load, versa many many hours for any of the alternatives:

UPDATE requests
SET session = NULL
WHERE session NOT IN
(
  SELECT r.session
  FROM requests r
  WHERE r.session IS NOT NULL
  GROUP BY r.session
  HAVING COUNT(*) > 1
);

Note that in order to correctly reverse an IN subquery, IS NOT NULL
needs to be added.

Interestingly, the query planner believes that using EXISTS would be
more efficient than NOT IN, and IN only slightly less efficient; I
assume the query planner is not able to accurately estimate the number
of rows returned by the subquery.

EXISTS            351'511
NOT IN            376'577
IN                386'780
LEFT JOIN      18'263'826
NOT EXISTS  7'241'815'330


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Reply via email to