Thanks Paul.

One more thought on this.

Don't forget to add any constraints and indexes back onto the new table AFTER it is populated as this will be much faster. You might also need to run select populate_geometry_columns(); if you need these updated after you do the rename.

Doing the rename in a transaction is atomic as far as other processes and queries are concerned, so this method works well in production environments when you need to reload a table and don't want the old table to be offline while you are doing the updates or reload.

-Steve

On 3/5/2012 1:40 PM, Paul Ramsey wrote:
This will be the fastest approach since it involves a new table
creation instead of re-writing an existing table. Good call Stephen.

P.

On Mon, Mar 5, 2012 at 8:17 AM, Stephen Woodbridge
<wood...@swoodbridge.com>  wrote:
On 3/5/2012 11:12 AM, DrYSG wrote:

I have a large table (20M records) but mostly short text fields. There are
duplicates that I am trying to remove.  I have a bigseriel index that I
unique, but duplicates in another field.

I have an 8 core, 24GB memory computer with RAID disks.

This request has been running for 70 hours (is it safe to kill it?).


Yes, it is safe.


How can I make this run faster? This is a one time processing task, but it
is taking a long time.


CREATE TABLE portal.new_metatdata AS
select distinct on (data_object.unique_id) * FROM portal.metadata;

Or something of this ilk should be faster because it only need to do a sort
on data_object.unique_id and then an insert. After you have verified the
results you can do:

BEGIN;
ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old;
ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old;
COMMIT;

Other people might have other ideas.
-Steve


DELETE FROM portal.metadata
WHERE idx NOT IN
(
     SELECT MIN(idx)
     FROM portal.metadata
     GROUP BY "data_object.unique_id"
);

CREATE TABLE
     metadata
     (
         data_object.unique_id CHARACTER(64) NOT NULL,
         size_bytes BIGINT,
         object_date TIMESTAMP(6) WITHOUT TIME ZONE,
         object_type CHARACTER VARYING(25),
         classification CHARACTER VARYING(7),
         object_source CHARACTER VARYING(50),
         object_managed_date TIMESTAMP(6) WITHOUT TIME ZONE,
         clevel INTEGER,
         fsctlh CHARACTER VARYING(50),
         oname CHARACTER VARYING(40),
         description CHARACTER VARYING(80),
         horizontal_datum CHARACTER VARYING(20),
         do_location CHARACTER VARYING(200),
         elevation_ft INTEGER,
         location USER-DEFINED,
         idx BIGINT DEFAULT nextval('portal.metadata_idx_seq'::regclass)
NOT
NULL,
         bbox CHARACTER VARYING(160),
         CONSTRAINT MetaDataKey PRIMARY KEY (idx)
     )

--
View this message in context:
http://postgis.17.n6.nabble.com/Slow-delete-of-duplicate-rows-tp4548251p4548251.html
Sent from the PostGIS - User mailing list archive at Nabble.com.
_______________________________________________
postgis-users mailing list
postgis-users@postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users


_______________________________________________
postgis-users mailing list
postgis-users@postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users
_______________________________________________
postgis-users mailing list
postgis-users@postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users

_______________________________________________
postgis-users mailing list
postgis-users@postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users

Reply via email to