On Wed, 2005-03-23 at 14:22 -0500, Keith Browne wrote:
> Simon Riggs wrote:
> > The EXPLAINs you've enclosed are for SELECTs, yet your bug report
> > describes INSERTs as being the things that are slow.
> > [You may find better performance from using COPY]
> We're starting with an empty database, creating four tables, and
> populating those tables with a total of 180,000-200,000 rows. Each
> table has a primary key, and several of the tables reference foreign
> keys in other tables. We've written a Python script, using psycopg,
> which executes all the queries to create the tables and insert the rows.
> The database is running on the same machine where the script runs.
> I've seen similar performance when issuing a COMMIT after each
> insertion, and also after batching insertions in blocks of 250 per
> COMMIT, so batching the commits is not helping much. I've looked at the
> possibility of using COPY, but in our production environment it will be
> prohibitive to build a flat file with all this data. I'd rather
> generate it on the fly, as we've been able to do with PostgreSQL 7.4.
> > Also, your tests have compared two systems, so it might be that the
> > hardware or configuration of one system is different from the other.
> When running with PostgreSQL 7.4 on a dual-CPU Athlon MP2400+ machine
> with a gigabyte of RAM, running Debian Linux version 18.104.22.168, we were
> able to insert all this data in 5-7 minutes. It's taken a while to
> install Postgres 8.0.1 on the same machine, but now I have, and it's
> taking 40-45 minutes to run the same insert script. This is similar to
> the performance we saw on another machine, a fast single-CPU AMD64 box
> running Gentoo.
> I don't think it's a hardware issue. I dug around a bit, and found
> suggestions that this sort of problem could be worked around by breaking
> the database connection and restarting it after the tables had been
> partially filled. I modified our script to break and re-establish the
> database connection when each table first has 4,000 records inserted,
> and the performance is greatly improved; it now takes only about 3.5
> minutes to insert 180,000+ rows.
> I've since modified this script to build and populate a fifth table with
> over 1.3 million rows. The fifth table has no primary key, but lists a
> foreign key into one of the first four tables. With the above
> modification (break and re-build the DB connection after 4,000 rows have
> been inserted), the whole database can be populated in about 15 minutes.
> I wouldn't have dared try to build a one-million-plus-row table until
> I found this speed-up.
> > If you could repeat the test on one single system, then this would
> > assist in the diagnosis of this bug report. Also, if you could describe
> > the workload that is giving you a problem more exactly, that would help.
> > Specifically, can you confirm that you have run ANALYZE on the tables,
> > and also give us some idea of numbers of rows in each table at the time
> > you first run your programs.
> Just to see if it would help, I tried modifying the script to run an
> ANALYZE against each table after 4,000 insertions, instead of breaking
> and re-establishing the DB connection. I still saw ~45-minute times to
> insert 180,000 rows. I then tried running ANALYZE against each table
> after *each* 4,000 rows inserted, and again, it took about 45 minutes to
> run the insert.
> Each table is empty when I first run the program. I am dropping and
> re-creating the database for each test run.
> > There is clearly a problem, but it is not yet clearly a bug. If it is a
> > bug, we're interested in solving it as much as you.
> I'd be happy to run further tests or provide more details, if they'll
> help. We now have a workaround which is allowing us to proceed with our
> project, but I'd like to know if there's another way to do this. While
> I understand that large or complex databases require careful tuning, I
> was surprised to see a six- or seven-fold increase in run times between
> PostgreSQL 7.4 and 8.0.1 on the same hardware, on an operation which
> seems fairly straightforward: populating an empty table.
> One other thing which puzzled me: as a test, I tried modifying our
> script to spit out raw SQL statements instead of connecting to the
> database and performing the inserts itself. Normally, our script
> populates two tables in one pass, and then populates the third and
> fourth tables in a second pass. I massaged the SQL by hand to group the
> inserts together by table, so that the first table would be entirely
> populated, then the second, etc. When I ran this SQL script by piping
> it straight into psql, it finished in about four minutes. This is
> comparable to the time it takes to run my modified script which breaks
> and re-establishes the connection to the database.
Your situation is covered in the manual with some sage advice
It doesn't go into great lengths about all the reasons why those
recommendations are good ones - but they are clear.
There isn't anything in there (yet) that says, "turn off Referential
Integrity too" and perhaps it should...
The tables you are loading all refer to one another with referential
constraints? Possibly a master-detail relationship, or two major
entities joined via an associative one. The plan is bad because your FKs
point to what are initially empty tables. The best thing to do would be
to add the RI constraints after the tables are loaded, rather than
adding them before.
Your program is issuing a Prepare statement, then followed by thousands
of Execute statements. This reduces much of the overhead of
optimization, since the plan is cached early in that sequence of
executes. The plan thus remains the same all the way through, though as
you observe, that isn't optimal. The initial plan saw an empty table,
though it didn't stay empty long. Breaking the connection and
reattaching forces the plan to be reevaluated; when this is performed
after the point at which a more optimal plan will be generated, your
further inserts use the better plan and work continues as fast as
psql doesn't suffer from this problem because it doesn't use Prepared
statements. That means you pay the cost of compiling each SQL statement
at execution time, though gain the benefit of an immediate plan change
at the optimal moment.
I think we should spawn a TODO item from this:
* Coerce FK lookups to always use an available index
but that in itself isn't a certain fix and might cause other
Best Regards, Simon Riggs
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster