Patrick Vedrines wrote:
Hi everyone,

I'm developping a web decisonnal application based on -Red Hat 3 ES -Postgresql 8.0.1 -Dell poweredge 2850, Ram 2Gb, 2 procs, 3 Ghz, 1Mb
cache and 4 disks ext3 10,000 r/mn I am alone in the box and there is
not any crontab.

I have 2 databases (A and B) with exactly the same schemas: -one main
table called "aggregate" having no indexes and supporting only SELECT
statements (loaded one time a month with a new bundle of datas).

Perhaps look into clustering the tables.

> Row
size # 200 bytes (50 columns of type char(x) or integer) -and several
small 'reference' tables not shown by the following example for
clarity reasons. -Database A : aggregate contains 2,300,000 records (
500 Mb) -Database B : aggregate contains 9,000,000 records ( 2 Gb)

There is no index on the aggregate table since the criterias, their
number and their scope are freely choosen by the customers.

Hmm... not convinced this is a good idea.

The query : select  sum(ca) from aggregate where  (issue_date >=
'2004-01' and issue_date <= '2004-02' ); takes 5s on database A (
5mn30s* the first time, probably to fill the cache) and  21mn* on
database B (whatever it is the first time or not).

Because A fits in the cache and B doesn't.

- I guess this is a cache issue but how can I manage/control it ? Is
Postgres managing it's own cache or does it use the OS cache ?


- Is using the cache is a good approach? It does not seem to work for
large databases : I tryed several different values for postgres.conf
and /proc/sys/kernel/shmmax without detecting any response time
enhancement (For example : shared_buffers = 190000 , sort_mem = 4096
, effective_cache_size = 37000 and kernel/shmmax=1200000000 ) Do I
have to upgrade the RAM to 6Gb or/and buy faster HD (of what type?) ?
 Moreover, a query on database B will destroy the cache previously
build for database A, increasing the response time for the next query
on database A. And I have in fact 15 databases !

If you don't have any indexes and the table isn't clustered then PG has no choice but to scan the entire table for every query. As you note, that's going to destroy your cache. You can increase the RAM but sooner or later, you'll get the same problem.

- In my case, what should be the best parameters combination between
postgres.conf and /proc/sys/kernel/shmmax ?

- is there a way to reduce the size of the "aggregate" table files
(1Gb + 1Gb + 1 Gb + 0.8Gb = 3.8Gb for the "aggregate" table instead
of 2Gb = 200 * 9,000,000 records) by playing with the data types or
others parameters (fillfactor ?). Vacuum (even full) seems to be
useless since the aggregate table supports only 'copy aggregate from'
and 'select'.

You can replace int4 with int2 and so on (where possible) but that will only delay problems.

- is it possible to define a sort of RAM filesystem (as it exists in
DOS/Windows) which I could create and populate my databases into ?
...since the databases does not support updates for this application.

Won't help - your cache is already doing that. Some things you can do (in order of effort)

1. Cluster the large tables
2. Analyse your customers' queries and try a couple of indexes - some choices will be more common than others.
3. Split your tables into two - common fields, uncommon fields, that way filtering on the common fields might take less space.
4. Split your tables by date, one table per month or year. Then re-write your customers' queries on-the-fly to select from the right table. Will only help with queries on date of course.
5. Place each database on its own machine or virtual machine so they don't interfere with each other.

I'd start with items 1,2 and see if that helps though.

PS - it might make sense to have an unusually large shared_mem for PG, but I'm not familiar enough with the changes in the cache handling in 8.0 to say for sure.
PPS - there are more changes coming for 8.1, but I know even less about those.

  Richard Huxton
  Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
     joining column's datatypes do not match

Reply via email to