Hi all:

    For those who've been following the saga, I'm working on an application 
that needs to load a data file consisting of approximately 29,000 to 35,000 
records in it (and not short ones, either) into several tables.  I'm using 
MySQL as the database.

    I've noticed a really horrible performance difference between INSERTing 
rows into the table and UPDATEing rows with new data when they already exist in 
the table.  For example, when I first start with an empty table, the 
application inserts around 29,600 records in something less than 6 minutes.  
But, when I use the second file, updating that same table takes over 90 minutes.

    Here's my question: I had assumed -- probably wrongly -- that it would be 
far more expedient to only update rows where data had actually changed; 
moreover, that I should only update the changed fields in the particular rows.  
This involves a large number of if statements, i.e.,

    if($old_row["field_a"] !== $new_row["field_66"] {
        $update_query .= "field_a = '" . 
mysql_real_escape_string($new_row["field_66"]) . "',";
    }

    Eventually, I wind up with a query similar to:

        UPDATE table_01 SET field_a = 'New value here', updated=CURDATE() WHERE 
primary_key=12345

    I thought that, to keep the table updating to a minimum, this approach made 
the most sense.  However, seeing the two hugely different performance times has 
made me question whether or not it would be faster to simply update every field 
in the table and eliminate all of these test conditions.

    And, before someone comments that indexes on the table can cause 
performance hits, I DROP nearly all of the indexes at the start of the 
processing, only keeping those indexes necessary to do the original INSERT or 
the subsequent UPDATE, and then add all of the extra "steroid" indexes (you 
know -- the performance-enhancing ones <g>) after all of the INSERTs and 
UPDATEs have been finished.

    So, long story short (oops -- too late!), what's the concensus among the 
learned assembly here?  Is it faster to just UPDATE the record if it already 
exists regardless of the fact that maybe only one or two out of 75 or more 
fields changed versus testing each one of those 75 fields to try and figure out 
which ones actually changed and then only update those?

    I look forward to reading all of your thoughts.

    Sincerely,

        Jon

Reply via email to