[SQL] De-duplicating rows

Christophe Thu, 16 Jul 2009 20:15:02 -0700

The Subject: is somewhat imprecise, but here's what I'm trying to do.For some reason, my brain is locking up over it.

I'm moving a 7.2 (yes) database to 8.4. In the table in question, thestructure is along the lines of:


        serial_number   SERIAL, PRIMARY KEY
        email           TEXT
        create_date     TIMESTAMP
        attr1           type
        attr2           type
        attr3           type
        ...

(The point of the "attr" fields is that there are many more columnsfor each row.)

The new structure removes the "serial_number" field, and uses "email"as the primary key, but is otherwise unchanged:


        email           TEXT, PRIMARY KEY
        create_date     TIMESTAMP
        attr1           type
        attr2           type
        attr3           type
        ...

Now, since this database has been production since 7.2 days, cruft hascrept in: in particular, there are duplicate email addresses, somewith mismatched attributes. The policy decision by the client is thatthe correct row is the one with the earliest timestamp. (Thetimestamps are widely distributed; it's not the case that there is asingle timestamp above which all the duplicates live.) Thus, ideally,I want to select exactly one row per "email", picking the row with theearliest timestamp in the case that there is more than one row withthat email.

Any suggestions on how to write such a SELECT? Of course, I could dothis with an application against the db, but a single SELECT would begreat if possible.


TIA!

--
Sent via pgsql-sql mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

[SQL] De-duplicating rows

Reply via email to