Re: [postgis-users] Slow delete of duplicate rows

Stephen Woodbridge Mon, 05 Mar 2012 11:04:27 -0800

Thanks Paul.

One more thought on this.

Don't forget to add any constraints and indexes back onto the new tableAFTER it is populated as this will be much faster. You might also needto run select populate_geometry_columns(); if you need these updatedafter you do the rename.

Doing the rename in a transaction is atomic as far as other processesand queries are concerned, so this method works well in productionenvironments when you need to reload a table and don't want the oldtable to be offline while you are doing the updates or reload.


-Steve

On 3/5/2012 1:40 PM, Paul Ramsey wrote:

This will be the fastest approach since it involves a new table
creation instead of re-writing an existing table. Good call Stephen.

P.

On Mon, Mar 5, 2012 at 8:17 AM, Stephen Woodbridge
<wood...@swoodbridge.com>  wrote:

On 3/5/2012 11:12 AM, DrYSG wrote:


I have a large table (20M records) but mostly short text fields. There are
duplicates that I am trying to remove.  I have a bigseriel index that I
unique, but duplicates in another field.

I have an 8 core, 24GB memory computer with RAID disks.

This request has been running for 70 hours (is it safe to kill it?).



Yes, it is safe.

How can I make this run faster? This is a one time processing task, but it
is taking a long time.



CREATE TABLE portal.new_metatdata AS
select distinct on (data_object.unique_id) * FROM portal.metadata;

Or something of this ilk should be faster because it only need to do a sort
on data_object.unique_id and then an insert. After you have verified the
results you can do:

BEGIN;
ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old;
ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old;
COMMIT;

Other people might have other ideas.
-Steve

DELETE FROM portal.metadata
WHERE idx NOT IN
(
     SELECT MIN(idx)
     FROM portal.metadata
     GROUP BY "data_object.unique_id"
);

CREATE TABLE
     metadata
     (
         data_object.unique_id CHARACTER(64) NOT NULL,
         size_bytes BIGINT,
         object_date TIMESTAMP(6) WITHOUT TIME ZONE,
         object_type CHARACTER VARYING(25),
         classification CHARACTER VARYING(7),
         object_source CHARACTER VARYING(50),
         object_managed_date TIMESTAMP(6) WITHOUT TIME ZONE,
         clevel INTEGER,
         fsctlh CHARACTER VARYING(50),
         oname CHARACTER VARYING(40),
         description CHARACTER VARYING(80),
         horizontal_datum CHARACTER VARYING(20),
         do_location CHARACTER VARYING(200),
         elevation_ft INTEGER,
         location USER-DEFINED,
         idx BIGINT DEFAULT nextval('portal.metadata_idx_seq'::regclass)
NOT
NULL,
         bbox CHARACTER VARYING(160),
         CONSTRAINT MetaDataKey PRIMARY KEY (idx)
     )

--
View this message in context:
http://postgis.17.n6.nabble.com/Slow-delete-of-duplicate-rows-tp4548251p4548251.html
Sent from the PostGIS - User mailing list archive at Nabble.com.
_______________________________________________
postgis-users mailing list
postgis-users@postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users



_______________________________________________
postgis-users mailing list
postgis-users@postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users

_______________________________________________
postgis-users mailing list
postgis-users@postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users


_______________________________________________
postgis-users mailing list
postgis-users@postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users

Re: [postgis-users] Slow delete of duplicate rows

Reply via email to