On Thu, Jul 16, 2009 at 9:07 PM, Christophe<x...@thebuild.com> wrote: > The Subject: is somewhat imprecise, but here's what I'm trying to do. For > some reason, my brain is locking up over it. > > I'm moving a 7.2 (yes) database to 8.4. In the table in question, the > structure is along the lines of: > > serial_number SERIAL, PRIMARY KEY > email TEXT > create_date TIMESTAMP > attr1 type > attr2 type > attr3 type > ... > > (The point of the "attr" fields is that there are many more columns for each > row.) > > The new structure removes the "serial_number" field, and uses "email" as the > primary key, but is otherwise unchanged: > > email TEXT, PRIMARY KEY > create_date TIMESTAMP > attr1 type > attr2 type > attr3 type > ... > > Now, since this database has been production since 7.2 days, cruft has crept > in: in particular, there are duplicate email addresses, some with mismatched > attributes. The policy decision by the client is that the correct row is > the one with the earliest timestamp. (The timestamps are widely > distributed; it's not the case that there is a single timestamp above which > all the duplicates live.) Thus, ideally, I want to select exactly one row > per "email", picking the row with the earliest timestamp in the case that > there is more than one row with that email. > > Any suggestions on how to write such a SELECT? Of course, I could do this > with an application against the db, but a single SELECT would be great if > possible.
OK, assuming we can keep the serial number during the conversion, we could use something like this: select distinct a.serial_number from table a join table b on (a.email=b.email and a.serial_number>b.serial_number) Now assuming that the serial numbers and the timestamps are in order together, that'll give us all the serial numbers for all the matching email addresses EXCEPT the first one. If the serial numbers are not in order with the timestamps, then create a sequence, and update them in order, then the query will work. Once you've confirmed by hand that the first hundred or so serial_numbers you're getting back ARE in fact all n+1 for the same email address, use the select in a subselect to delete: delete from table x where serial_number in (select distinct....) -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql