Thanks Mallah,
I will keep this example in case I need it again sometime in the future.
Unfortunately, I do not have enough free space at the moment to create a
temp table. 


-----Original Message-----
From: Rajesh Kumar Mallah [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 13, 2004 10:27 AM
To: Shea,Dan [CIS]
Cc: Postgres Performance
Subject: Re: [PERFORM] Deleting certain duplicates

Shea,Dan [CIS] wrote:

>The index is
>    "forecastelement_rwv_idx" btree (region_id, wx_element, valid_time)
>-----Original Message-----
>From: Shea,Dan [CIS] [mailto:[EMAIL PROTECTED]
>Sent: Monday, April 12, 2004 10:39 AM
>To: Postgres Performance
>Subject: [PERFORM] Deleting certain duplicates
>We have a large database which recently increased dramatically due to a
>change in our insert program allowing all entries.
>PWFPM_DEV=# select relname,relfilenode,reltuples from pg_class where
>= 'forecastelement';
>     relname     | relfilenode |  reltuples
> forecastelement |   361747866 | 4.70567e+08
>     Column     |            Type             | Modifiers
> version        | character varying(99)       |
> origin         | character varying(10)       |
> timezone       | character varying(99)       |
> region_id      | character varying(20)       |
> wx_element     | character varying(99)       |
> value          | character varying(99)       |
> flag           | character(3)                |
> units          | character varying(99)       |
> valid_time     | timestamp without time zone |
> issue_time     | timestamp without time zone |
> next_forecast  | timestamp without time zone |
> reception_time | timestamp without time zone |
>The program is supposed to check to ensure that all fields but the
>reception_time are unique using a select statement, and if so, insert it.
>Due an error in a change, reception time was included in the select to
>for duplicates.  The reception_time is created by a program creating the
>file to insert. 
>Essentially letting all duplicate files to be inserted.
>I tried the delete query below.
>PWFPM_DEV=# delete from forecastelement where oid not in (select min(oid)
>from forecastelement group by
>It ran for 3 days creating what I assume is an index in pgsql_tmp of the
>group by statement. 
>The query ended up failing with "dateERROR:write failed".
>Well the long weekend is over and we do not have the luxury of trying this
>So I was thinking maybe of doing the deletion in chunks, perhaps based on
>reception time.

its more of an sql question though.

to deduplicate on basis of


You could do this.

begin work;
create temp_table as select distinct on 
issue_time,next_forecast) * from forecastelement ;
truncate table forecastelement ;
drop index <index on forecastelement >  ;
insert into forecastelement  select * from temp_table ;
create indexes
Analyze forecastelement ;

note that distinct on will keep only one row out of all rows having  
distinct values
of the specified columns. kindly go thru the distinct on manual before 
the queries.


>Are there any suggestions for a better way to do this, or using multiple
>queries to delete selectively a week at a time based on the reception_time.
>I would say there are a lot of duplicate entries between mid march to the
>first week of April.
>---------------------------(end of broadcast)---------------------------
>TIP 8: explain analyze is your friend
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to [EMAIL PROTECTED] so that your
>      message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?


Reply via email to