On 7/26/05, Dan Harris <[EMAIL PROTECTED]> wrote:
> I am working on a process that will be inserting tens of million rows
> and need this to be as quick as possible.
> The catch is that for each row I could potentially insert, I need to
> look and see if the relationship is already there  to prevent
> multiple entries.  Currently I am doing a SELECT before doing the
> INSERT, but I recognize the speed penalty in doing to operations.  I
> wonder if there is some way I can say "insert this record, only if it
> doesn't exist already".  To see if it exists, I would need to compare
> 3 fields instead of just enforcing a primary key.

I struggled with this for a while. At first I tried stored procedures
and triggers, but it took very long (over 24 hours for my dataset).
After several iterations of rewritting it, first into C# then into
Python I got the whole process down to under 30 min.

My scenario is this:
I want to normalize log data. For example, for the IP address in a log
entry, I need to look up the unique id of the IP address, or if the IP
address is new, insert it and then return the newly created entry.
Multiple processes use the data, but only one process, run daily,
actually changes it. Because this one process knows that the data is
static, it selects the tables into in-memory hash tables (C#) or
Dictionaries (Python) and then does the lookups there. It is *super*
fast, but it uses a *lot* of ram. ;-)

To limit the ram, I wrote a version of the python code that uses gdbm
files instead of Dictionaries. This requires a newer version of Python
(to allow a gdbm db to work just like a dictionary) but makes life
easier in case someone is using my software on a lower end machine.
This doubled the time of the lookups from about 15 minutes to 30,
bringing the whole process to about 45 minutes.

Matthew Nuzum

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Reply via email to