RE: Fastest way to load a master table removing duplicates - Not Answered

SGreen Mon, 21 Jun 2004 06:31:26 -0700

Paul,

These opinions reflect my own experience, your mileage may vary.


I the "fastest" way would depend on your source data. For most data dumps I
use LOAD DATA INFILE as it will read both delimited and fixed width text.
When applied to a table with NO indexes (indices?)  on it I will get the
best results translating data files into a MySQL database.

Now, how to detect dupes and not break your primary keys..... There are
several ways of checking the integrity of your source data.
you ask: "If  inserted records have a duplicate SSN I don't want to insert
those but put  them in a duplicate table or flag them" and "Insert the row
and ignore duplicate using insert into master ( .)"
Those are similar in what you are trying to do. Let's look at flagging or
re-logging duplicate source rows first.

It would be very useful to apply some "artificial" uniqueness constraint to
your random source data unless you are absolutely certain that you can
identify a primary key.  I typically do this (when I need to) by importing
the data into a table with an auto_incrementing ID field. That way I can
easily identify each and every row in my source data. Using this field it
is rather simple to collect which rows duplicate themselved in the columns
you need to have unique.

SELECT <column list>
FROM import_table
GROUP BY <column list>
having count(1) >1

<column list> represents the list of columns in which you do not want data
duplicated.

To copy those to another table takes two steps:CREATE a list of your
duplicated primary key columns, then INSERT the duplicated rows into the
other table

CREATE TEMPORARY TABLE tmpDupes
SELECT <column list>
FROM import_table
GROUP BY <column list>
having count(1) >1

INSERT duplicated_table (<destination column list>)
SELECT <source column list>
FROM import_table
INNER JOIN tmpDupes
      ON ...
      AND ...
      AND ... <one comparison per column in tmpDupes against the same
column in import_table


DROP TABLE tmpDupes

To merge your imported data with your existing data in such a way as to NOT
break your existing primary keys, you can use the INSERT with the IGNORE
keyword.

I will sometimes add keys to my import table (after it is loaded) to speed
up this processing. I drop them again before the next batch import.
To flag your import data, you will have to ALTER TABLE ... ADD a new column
to it then run an update query. I suggest you create a temp table for your
duplicate keys (like I did above) so that you can JOIN that list back into
your import data to quickly update the new flag field for only those rows
that are duplicated.

I am sorry I could not be more specific but you did ask a VERY generic
question.

Yours,
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine




                                                                                       
                                 
                      "Paul Chu"                                                       
                                 
                      <[EMAIL PROTECTED]        To:       <[EMAIL PROTECTED]>          
                              
                      net>                     cc:                                     
                                 
                                               Fax to:                                 
                                 
                      06/18/2004 10:02         Subject:  RE: Fastest way to load a 
master table removing duplicates -   
                      PM                        Not Answered                           
                                 
                                                                                       
                                 
                                                                                       
                                 




Appreciate any help at all

Thanks, Paul



-----Original Message-----
From: Paul Chu [mailto:[EMAIL PROTECTED]
Sent: Friday, June 18, 2004 10:16 AM
To: [EMAIL PROTECTED]
Subject: Fastest way to load a master table removing duplicates

Hi all,



I want to load a Master table millions of rows for other sources with a
unique index on e.g. ssn social sec number.

If  inserted records have a duplicate SSN I don't want to insert those
but put  them in a duplicate table or flag them.

.

The primary key will be an auto-increment field. There will be  other
indexes such as zipcode..



What is the fastest way to load these rows and remove duplicates ?

Assume I load the rows to be inserted into another table.



1.           Check if the ssn already exists before inserting the row ?



2.  Insert the row and ignore duplicate using

insert into master ( .)

      select .. From loaddata



I have lots of  files with data that can be saved to load tables and
then inserted into the master table.





Regards, Paul







--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]







-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

RE: Fastest way to load a master table removing duplicates - Not Answered

Reply via email to