At 05:06 PM 9/1/2005, Matthew Sackman wrote:
On Thu, Sep 01, 2005 at 10:09:30PM +0200, Steinar H. Gunderson wrote:
> >     "address_city_index" btree (city)
> >     "address_county_index" btree (county)
> >     "address_locality_1_index" btree (locality_1)
> >     "address_locality_2_index" btree (locality_2)
> >     "address_pc_bottom_index" btree (postcode_bottom)
> >     "address_pc_middle_index" btree (postcode_middle)
> >     "address_pc_top_index" btree (postcode_top)
> >     "address_pc_top_middle_bottom_index" btree (postcode_top,
> >                              postcode_middle, postcode_bottom)
> >     "address_pc_top_middle_index" btree (postcode_top, postcode_middle)
> >     "address_postcode_index" btree (postcode)
> >     "address_property_type_index" btree (property_type)
> >     "address_street_index" btree (street)
> >     "street_prefix" btree (lower("substring"((street)::text, 1, 1)))
> Wow, that's quite a lof of indexes... but your problem isn't reported as
> being in insert/update/delete.

Hah, well now that you mention it. Basically, 100,000 rows come in in a
bulk import every month and the only way I can get it to complete in any
sane time frame at all is to drop the indexes, do the import and then
recreate the indexes. But that's something that I'm OK with -

FTR, this "drop the indexes, do <foo>, recreate the indexes" is Industry Standard Practice for bulk inserts/updates/deletes. Regardless of DB product used.

- the imports don't have to be that fast and whilst important, it's not *the*
critical path.  Selection from the database is, hence the indexes.

A DB _without_ indexes that fits into RAM during ordinary operation may actually be faster than a DB _with_ indexes that does not. Fitting the entire DB into RAM during ordinary operation if at all possible should be the first priority with a small data mine-like application such as you've described.

Also normalization is _not_ always a good thing for data mining like apps. Having most or everything you need in one place in a compact and regular format is usually more effective for data mines than "Nth Order Normal Form" optimization to the degree usually found in textbooks using OLTP-like examples.

Indexes are a complication used as a performance enhancing technique because without them the DB is not performing well enough. IME, it's usually better to get as much performance as one can from other aspects of design and _then_ start adding complications. Including indexes. Even if you fit the whole DB in RAM, you are very likely to need some indexes; but profile your performance first and then add indexes as needed rather than just adding them willy nilly early in the design process.

You said you had 1GB of RAM on the machine now. That clearly is inadequate to your desired performance given what you said about the DB. Crank that box to 4GB and tighten up your data structures. Then see where you are.

> > This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a
> > SATA harddrive.
> 8.0 or 8.1 might help you some -- better (and more!) disks will probably help
> a _lot_.

Ok, I did try 8.0 when I started this and found that the server bind
parameters (both via DBD::Pg (with pg_prepare_server => 1) and via JDBC
(various versions I tried)) failed - the parameters were clearly not
being substituted. This was Postgresql 8.0 from Debian unstable. That
was a couple of weeks ago and I've not been back to check whether its
been fixed. Anyway, because of these problems I dropped back to 7.4.

Since I assume you are not going to run anything with the string "unstable" in its name in production (?!), why not try a decent production ready distro like SUSE 9.x and see how pg 8.0.3 runs on a OS more representative of what you are likely (or at least what is safe...) to run in production?

> > Queries such as:
> >
> > select locality_2 from address where locality_2 = 'Manchester';
> >
> > are taking 14 seconds to complete, and this is only 2 years worth of
> > data - we will have up to 15 years (so over 15 million rows).
> As Tom pointed out; you're effectively doing random searches here, and using
> CLUSTER might help. Normalizing your data to get smaller rows (and avoid
> possibly costly string comparisons if your strcoll() is slow) will probably
> also help.

Ok, so you're saying that joining the address table into an address_city
table (the obvious normalization) will help here?

The locale settings in postgresql.conf all have en_GB and a \l shows
encoding of LATIN1. So I don't think I've set anything to UTF8 or such

> > I need to get to the stage where I can run queries such as:
> > select street, locality_1, locality_2, city from address
> > where (city = 'Nottingham' or locality_2 = 'Nottingham'
> >        or locality_1 = 'Nottingham')
> >   and upper(substring(street from 1 for 1)) = 'A'
> > group by street, locality_1, locality_2, city
> > order by street
> > limit 20 offset 0
> This might be a lot quicker than pulling all the records like in your example
> queries...

Yes, that certainly does seem to be the case - around 4 seconds. But I
need it to be 10 times faster (or thereabouts) otherwise I have big

*beats drum* Get it in RAM, Get it in RAM, ...

Ron Peacetree

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to