Hi all:
For those who've been following the saga, I'm working on an application
that needs to load a data file consisting of approximately 29,000 to 35,000
records in it (and not short ones, either) into several tables. I'm using
MySQL as the database.
I've noticed a really horrible performance difference between INSERTing
rows into the table and UPDATEing rows with new data when they already exist in
the table. For example, when I first start with an empty table, the
application inserts around 29,600 records in something less than 6 minutes.
But, when I use the second file, updating that same table takes over 90 minutes.
Here's my question: I had assumed -- probably wrongly -- that it would be
far more expedient to only update rows where data had actually changed;
moreover, that I should only update the changed fields in the particular rows.
This involves a large number of if statements, i.e.,
if($old_row["field_a"] !== $new_row["field_66"] {
$update_query .= "field_a = '" .
mysql_real_escape_string($new_row["field_66"]) . "',";
}
Eventually, I wind up with a query similar to:
UPDATE table_01 SET field_a = 'New value here', updated=CURDATE() WHERE
primary_key=12345
I thought that, to keep the table updating to a minimum, this approach made
the most sense. However, seeing the two hugely different performance times has
made me question whether or not it would be faster to simply update every field
in the table and eliminate all of these test conditions.
And, before someone comments that indexes on the table can cause
performance hits, I DROP nearly all of the indexes at the start of the
processing, only keeping those indexes necessary to do the original INSERT or
the subsequent UPDATE, and then add all of the extra "steroid" indexes (you
know -- the performance-enhancing ones <g>) after all of the INSERTs and
UPDATEs have been finished.
So, long story short (oops -- too late!), what's the concensus among the
learned assembly here? Is it faster to just UPDATE the record if it already
exists regardless of the fact that maybe only one or two out of 75 or more
fields changed versus testing each one of those 75 fields to try and figure out
which ones actually changed and then only update those?
I look forward to reading all of your thoughts.
Sincerely,
Jon