[PERFORM] Deleting certain duplicates

Shea,Dan [CIS] Mon, 12 Apr 2004 07:44:40 -0700

We have a large database which recently increased dramatically due to a
change in our insert program allowing all entries.
PWFPM_DEV=# select relname,relfilenode,reltuples from pg_class where relname
= 'forecastelement';
     relname     | relfilenode |  reltuples
-----------------+-------------+-------------
 forecastelement |   361747866 | 4.70567e+08


     Column     |            Type             | Modifiers
----------------+-----------------------------+-----------
 version        | character varying(99)       |
 origin         | character varying(10)       |
 timezone       | character varying(99)       |
 region_id      | character varying(20)       |
 wx_element     | character varying(99)       |
 value          | character varying(99)       |
 flag           | character(3)                |
 units          | character varying(99)       |
 valid_time     | timestamp without time zone |
 issue_time     | timestamp without time zone |
 next_forecast  | timestamp without time zone |
 reception_time | timestamp without time zone |

The program is supposed to check to ensure that all fields but the
reception_time are unique using a select statement, and if so, insert it.
Due an error in a change, reception time was included in the select to check
for duplicates.  The reception_time is created by a program creating the dat
file to insert. 
Essentially letting all duplicate files to be inserted.

I tried the delete query below.
PWFPM_DEV=# delete from forecastelement where oid not in (select min(oid)
from forecastelement group by
version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,iss
ue_time,next_forecast);
It ran for 3 days creating what I assume is an index in pgsql_tmp of the
group by statement. 
The query ended up failing with "dateERROR:write failed".
Well the long weekend is over and we do not have the luxury of trying this
again. 
So I was thinking maybe of doing the deletion in chunks, perhaps based on
reception time.
Are there any suggestions for a better way to do this, or using multiple
queries to delete selectively a week at a time based on the reception_time.
I would say there are a lot of duplicate entries between mid march to the
first week of April.



---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

[PERFORM] Deleting certain duplicates

Reply via email to