Hello all: Question here about how best to optimize a large data load. Data load is ad hoc and so needs to be able to run during production hours.
We have a piece of software written in a desktop RAD environment (FileMaker Pro). Our users are using this tool to generate data that we need to get into postgresql. We're sending it through a middleware layer written in PHP. A single data load will generate from 10K-100K rows. Initially we did everything procedurally through PHP. 100K inserts, each one called through several layers of PHP abstraction. Bad idea. Current idea is to have PHP dump the data to a file, and suck the whole file in at once somehow. So far, so good: PHP can create the file in 6 minutes for 100K rows. That's actually acceptable. Now we want to use COPY to bring the data in. The target table has 6 indexes. Without indexes, naturally, we can load 80K rows in 2 seconds. With indexes, 46 seconds. (oddly, ONE index contributes 40+ seconds of that, yet they're all similar, single-column indexes. Anyway, that's another story). Normally, in a batch-like environment, I'd feel free to drop the indexes, load, reindex. That's perfectly fast. But the environment needs to be live, and those indexes are vital to a reporting engine that can be hit at any time. So we can't just drop them, even briefly. So I hit on the idea of doing the same thing, but inside a transaction. In theory that should affect no one else. To my delight, the transaction drop-copy-reindex ran in 7 seconds. I guess I'm just wondering how that's possible. I hate to sound like a superstitious goof, but it sounds to good to be true. At best, I figured to pay the whole penalty at the time of COMMIT -- that it would be fast up to that point, and then of course need to do exactly the same work as the transactionless version, as far as reconciling indexes or whatever the more accurately technical term is. So: is this too good to be true? Or is this actually a good way to do this? Any other observations on the whole process? Is there a better or different approach, or other things we should consider? Any and all thoughts are welcome. -- sgl ======================================================= Steve Lane Vice President The Moyer Group 14 North Peoria St Suite 2H Chicago, IL 60607 Voice: (312) 433-2421 Email: [EMAIL PROTECTED] Fax: (312) 850-3930 Web: http://www.moyergroup.com ======================================================= ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html