We're in the process of setting up a new database server. The
application is an online rss aggregator which you can see at
www.fastbuzz.com (still running with the old hardware).

The new machine is a dual Xeon with 2 Gigs of ram 

The OS is freebsd 4.9. 

shared_buffers = 10000
sort_mem = 32768
effective_cache_size = 25520  -- freebsd forumla: vfs.hibufspace / 8192

1. While it seems to work correctly, I'm unclear on why this number is
correct.  25520*8 = 204160 or 200 Megs. On a machine with 2 Gigs it
seems like the number should be more like 1 - 1.5 Gigs.

2.  The main performance challenges are with the items table which has around
five million rows and grows at the rate of more than 100,000 rows a day.

If I do a select count(*) from the items table it take 55 - 60 seconds
to execute. I find it interesting that it takes that long whether it's
doing it the first time and fetching the pages from disk or on
subsequent request where it fetches the pages from memory.
I know that it's not touching the disks because I'm running an iostat in
a different window. Here's the explain analyze:

explain analyze select count(*) from items;
                                                        QUERY PLAN
 Aggregate  (cost=245377.53..245377.53 rows=1 width=0) (actual 
time=55246.035..55246.040 rows=1 loops=1)
   ->  Seq Scan on items  (cost=0.00..233100.62 rows=4910762 width=0)
(actual time=0.054..30220.641 rows=4910762 loops=1)
 Total runtime: 55246.129 ms
(3 rows)

and the number of pages:

select relpages from pg_class where relname = 'items';

So does it make sense that it would take close to a minute to count the 5 million rows
even if all pages are in memory? 

3. Relpages is 183993 so file size should be  183993*8192 = 1507270656,
roughly 1.5 gig. The actual disk size is 1073741824 or roughly 1 gig.
Why the difference?

4. If I put a certain filter/condition on the query it tells me that it's doing
a sequential scan, and yet it takes less time than a full sequential

explain analyze select count(*) from items where channel < 5000;
                                                        QUERY PLAN
 Aggregate  (cost=249141.54..249141.54 rows=1 width=0) (actual 
time=26224.603..26224.608 rows=1 loops=1)
   ->  Seq Scan on items  (cost=0.00..245377.52 rows=1505605 width=0) (actual 
time=7.599..17686.869 rows=1632057 loops=1)
         Filter: (channel < 5000)
 Total runtime: 26224.703 ms

How can it do a sequential scan and apply a filter to it in less time
than the full sequential scan? Is it actually using an index without
really telling me? 

Here's the structure of the items table

    Column     |           Type           | Modifiers
 articlenumber | integer                  | not null
 channel       | integer                  | not null
 title         | character varying        |
 link          | character varying        |
 description   | character varying        |
 comments      | character varying(500)   |
 dtstamp       | timestamp with time zone |
 signature     | character varying(32)    |
 pubdate       | timestamp with time zone |
    "item_channel_link" btree (channel, link)
    "item_created" btree (dtstamp)
    "item_signature" btree (signature)
    "items_channel_article" btree (channel, articlenumber)
    "items_channel_tstamp" btree (channel, dtstamp)

5. Any other comments/suggestions on the above setup.



Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Reply via email to