Removing a specific set of duplicates

Scott Haneda Tue, 20 Jul 2004 15:38:26 -0700

Mysql 4

I have a slightly tougher than usual condition in which I want to remove
duplicates.  I am not sure how to deal with this and am looking for your
suggestions.


The table that stores my data is called addresses, it contains among others,
a field called email_address.  Within this table emails are grouped on a
unique id, so for example, select email_address from addresses where
group_id = '5' AND user_id = '2' would show me all the addresses that I want
to work on.

The case is that users will always be adding more addresses to this group,
this is for a mailing list manager I am working on.  The trouble is that I
suspect users will upload a batch of addresses, then a few weeks later, they
will upload a new set, but they will contain the old set as well.  This
would pile up the duplicates in short order, and I don't want multiple
emails sent to the same person over and over again.

My first option is when they upload new addresses, to select and test for
the existence of that address.  If it exists, do not add it, otherwise I
will add it.  The size of some of these lists are large, in the 10's of
thousands.  I suspect this will add too much overhead to the import time.  I
can not make the column unique as there is good reason to have the email
address in the column more than once, since they are "groups" of email
addresses.

My thought is to allow the import of all the addresses, allow all the dupes,
then take out the dupes, I suspect this will be faster than a select for
every email address I want to import.  I think this involves selecting
distinct() into a temp table, deleting the addresses from the main table,
then selecting into the old table from the temp table and then destroying
the temp table.  If anyone can suggest a tricky way to do this with perhaps
a group by clause to simpy remove the dupes in one go, I would love to hear
it.

In regards to the temp table, is it up to me to maintain a unique temp table
name to not collide in the event 2 users were to hit the page at the same
time?


And now, the other "rub"....
Another field in the addresses table I used to track bounced emails, lets
call it 'bounces', which I increment by 1 on every bounce.  Here is the
other issue, I can not simply remove the dupes without first determining
which dupe to remove.  Basically, I want to remove all dupes where the
bounce count is 0 (default), but if there are more than 2 dupes, I want to
keep the one with the highest bounce count.

Any idea how I should be approaching this?
-- 
-------------------------------------------------------------
Scott Haneda                                Tel: 415.898.2602
http://www.newgeo.com                       Fax: 313.557.5052
[EMAIL PROTECTED]                            Novato, CA U.S.A.



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Removing a specific set of duplicates

Reply via email to