Re: [PERFORM] Massive performance issues

Ron Thu, 01 Sep 2005 15:18:27 -0700

At 05:06 PM 9/1/2005, Matthew Sackman wrote:

On Thu, Sep 01, 2005 at 10:09:30PM +0200, Steinar H. Gunderson wrote:
> >     "address_city_index" btree (city)
> >     "address_county_index" btree (county)
> >     "address_locality_1_index" btree (locality_1)
> >     "address_locality_2_index" btree (locality_2)
> >     "address_pc_bottom_index" btree (postcode_bottom)
> >     "address_pc_middle_index" btree (postcode_middle)
> >     "address_pc_top_index" btree (postcode_top)
> >     "address_pc_top_middle_bottom_index" btree (postcode_top,
> >                              postcode_middle, postcode_bottom)
> >     "address_pc_top_middle_index" btree (postcode_top, postcode_middle)
> >     "address_postcode_index" btree (postcode)
> >     "address_property_type_index" btree (property_type)
> >     "address_street_index" btree (street)
> >     "street_prefix" btree (lower("substring"((street)::text, 1, 1)))
>
> Wow, that's quite a lof of indexes... but your problem isn't reported as
> being in insert/update/delete.


Hah, well now that you mention it. Basically, 100,000 rows come in in a
bulk import every month and the only way I can get it to complete in any
sane time frame at all is to drop the indexes, do the import and then
recreate the indexes. But that's something that I'm OK with -

FTR, this "drop the indexes, do <foo>, recreate the indexes" isIndustry Standard Practice for bulkinserts/updates/deletes. Regardless of DB product used.

- the imports don't have to be that fast and whilst important,it's not *the*
critical path.  Selection from the database is, hence the indexes.

A DB _without_ indexes that fits into RAM during ordinary operationmay actually be faster than a DB _with_ indexes that doesnot. Fitting the entire DB into RAM during ordinary operation if atall possible should be the first priority with a small data mine-likeapplication such as you've described.

Also normalization is _not_ always a good thing for data mining likeapps. Having most or everything you need in one place in a compactand regular format is usually more effective for data mines than "NthOrder Normal Form" optimization to the degree usually found intextbooks using OLTP-like examples.

Indexes are a complication used as a performance enhancing techniquebecause without them the DB is not performing well enough. IME, it'susually better to get as much performance as one can from otheraspects of design and _then_ start adding complications. Includingindexes. Even if you fit the whole DB in RAM, you are very likely toneed some indexes; but profile your performance first and then addindexes as needed rather than just adding them willy nilly early inthe design process.

You said you had 1GB of RAM on the machine now. That clearly isinadequate to your desired performance given what you said about theDB. Crank that box to 4GB and tighten up your data structures. Thensee where you are.

> > This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a
> > SATA harddrive.
>

> 8.0 or 8.1 might help you some -- better (and more!) disks willprobably help

> a _lot_.

Ok, I did try 8.0 when I started this and found that the server bind
parameters (both via DBD::Pg (with pg_prepare_server => 1) and via JDBC
(various versions I tried)) failed - the parameters were clearly not
being substituted. This was Postgresql 8.0 from Debian unstable. That
was a couple of weeks ago and I've not been back to check whether its
been fixed. Anyway, because of these problems I dropped back to 7.4.

Since I assume you are not going to run anything with the string"unstable" in its name in production (?!), why not try a decentproduction ready distro like SUSE 9.x and see how pg 8.0.3 runs on aOS more representative of what you are likely (or at least what issafe...) to run in production?

> > Queries such as:
> >
> > select locality_2 from address where locality_2 = 'Manchester';
> >
> > are taking 14 seconds to complete, and this is only 2 years worth of
> > data - we will have up to 15 years (so over 15 million rows).
>

> As Tom pointed out; you're effectively doing random searcheshere, and using

> CLUSTER might help. Normalizing your data to get smaller rows (and avoid
> possibly costly string comparisons if your strcoll() is slow) will probably
> also help.

Ok, so you're saying that joining the address table into an address_city
table (the obvious normalization) will help here?

The locale settings in postgresql.conf all have en_GB and a \l shows
encoding of LATIN1. So I don't think I've set anything to UTF8 or such
like.

> > I need to get to the stage where I can run queries such as:
> > select street, locality_1, locality_2, city from address
> > where (city = 'Nottingham' or locality_2 = 'Nottingham'
> >        or locality_1 = 'Nottingham')
> >   and upper(substring(street from 1 for 1)) = 'A'
> > group by street, locality_1, locality_2, city
> > order by street
> > limit 20 offset 0
>

> This might be a lot quicker than pulling all the records like inyour example

> queries...

Yes, that certainly does seem to be the case - around 4 seconds. But I
need it to be 10 times faster (or thereabouts) otherwise I have big
problems!


*beats drum* Get it in RAM, Get it in RAM, ...

Ron Peacetree



---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Massive performance issues

Reply via email to