> ________________________________________
> From: [EMAIL PROTECTED] [EMAIL PROTECTED] On Behalf Of > Jean-David Beyer 
> [EMAIL PROTECTED]
> Sent: Tuesday, December 09, 2008 5:08 AM
> To: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Need help with 8.4 Performance Testing
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1

> But one thing that can matter is whether you want to improve just the
> performance of the dbms, or the performance of the entire system, on which
> the dbms runs. Because if you want to improve the whole system, you would
> want as much of the caching to take place in the system's buffers so the use
> of the memory could be optimized over the entire workload, not just the load
> of the dbms itself. I suppose on a dedicated system with only one dbms
> running with only one database open (at a time, anyway), this might be moot,
> but not otherwise.

Yes, the OS is in better position to arbitrate between multiple things.  Of 
course, we aren't talking about the highest performance databases if we are 
talking about mixed use systems though.
Additionally, the OS can never really get it right, with a DB or other apps.  
Any app can behave badly and grab too much RAM and access it regularly enough 
for it to not be 'idle' much but give the OS VM fits trying to figure out if 
its important or not versus other processes.

> Now I agree that it would be good to get the entire index (or at least the
> working set of the index) into the memory of the computer. But does it
> really matter if it is in the system's cache, or the postgres cache? Is it
> any more likely to be in postgres's cache than in the system cache if the
> system is hurting for memory? I would think the system would be equally
> likely to page out "idle" pages no matter where they are unless they are
> locked to memory, and I do not know if all operating systems can do this,
> and even if they can, I do not know if postgres uses that ability. I doubt
> it, since I believe (at least in Linux) a process can do that only if run as
> root, which I imagine few (if any) users do.

The problem, is when none of them are really 'idle'.  When the OS has to decide 
which pages, all of which have been accessed recently, to evict.  Most OS's 
will make bad choices if the load is mixed random and sequential access, as 
they treat all pages equally with respect to freshness versus eviction.
Another problem is that there IS a difference between being in postgres' cache 
and the OS cache.  One is more expensive to retrieve than the other.  
Significantly.

Aaccessing buffers in shared_buffers, in process, uses a good chunk less CPU 
(and data copy and shared buffer eviction overhead) than going over the sys 
call to the OS.

And as far as I can tell, even after the 8.4 fadvise patch, all I/O is in 
block_size chunks.  (hopefully I am wrong)
My system is now CPU bound, the I/O can do sequential reads of more than 
1.2GB/sec but Postgres can't do a seqscan 30% as fast because it eats up CPU 
like crazy just reading and identifying tuples.  It does seqscans ~ 25% faster 
if its from shared_buffers than from the OS's page cache though.   Seqscans are 
between 250MB/sec and 400MB/sec peak, from mem or disk, typically showing no 
more than 35% iostat utilization of the array if off disk -- so we run a few 
concurrently where we can.

In addition to the fadvise patch, postgres needs to merge adjacent I/O's into 
larger ones to reduce the overhead.  It only really needs to merge up to sizes 
of about 128k or 256k, and gain a 8x to 16x drop in syscall overhead, and 
additionally potentially save code trips down the shared buffer management code 
paths.  At lest, thats my guess I haven't looked at any code and could be wrong.


Additionally, the "If your operating system has any reasonable caching itself" 
comment earlier in this conversation ---  Linux (2.6.18, Centos 5.2) does NOT.  
I can easily make it spend 100% CPU in system time trying to figure out what to 
do with the system cache for an hour.  Just do large seqscans with memory 
pressure from work_mem or other forces that the OS will not deem 'idle'.  Once 
the requested memory is ~75% of the system total, it will freak out.  Linux 
simply will not give up that last 25% or so of the RAM for anything but page 
cache, even though the disk subsustem is very fast and most of the access is 
sequential, marginalizing the benefit of the cache.  Depending on how you tune 
it, it will either spin system cpu or swap storm, but the system cpu spin times 
for the same work load are a lot shorter than an equivalent swap storm.
Mount the data drive in O_DIRECT and the problem vanishes.  I've been told that 
this problem may be gone in some of the latest kernels.  I have seriously 
considered bumping shared_buffers up a lot and mounting the thing direct -- but 
then we lose the useful scheduler and readahead algorithms.  The other way 
around (small shared_buffers, let the OS do it) hurts performance overall quite 
a bit -- randomly accessed pages get pushed out to the OS cache more often, and 
the OS tosses thouse out when a big seqscan occurs, resulting in a lot more 
random access from disk and more disk bound periods of time. Great wonder, this 
operating system caching, eh?

In any event, don't hold up these OS page cache things as if they're the best 
thing in the world for a database, they have serious flaws themselves and 
typically are difficult or impossible to tune to be ideal for a database.

Its one thing to propose that a database build its own file system (hard, and 
why bother?) versus have a database manage its own page cache intelligently and 
access the OS file system as optimally as it can.  In both of the latter, the 
DB knows much more about what data is really important than the OS (and could 
for example, prioritize cache versus work_mem intelligently while the OS can 
get that one horribly wrong in my experience, and knows when a huge seqscan 
occurs to make caching those results low priority).  No matter how you do it 
using the OS cache, you cache twice and copy twice.  O_DIRECT isn't usually an 
option for other reasons, the OS disk scheduler, readahead, and other benefits 
of a file system are real and substantial.  If you are caching twice, you might 
as well have the "closer" copy of that data be the larger, more efficient pool.

As for tipping points and pg_bench -- It doesn't seem to reflect the kind of 
workload we use postgres for at all, though my workload does a lot of big 
hashes and seqscans, and I'm curious how much improved those may be due to the 
hash improvements.  32GB RAM and 3TB data (about 250GB scanned regularly) here. 
 And yes, we are almost completely CPU bound now except for a few tasks.  
Iostat only reports above 65% disk utilization for about 5% of the workload 
duty-cycle, and is regularly < 20%.  COPY doesn't get anywhere near platter 
speeds, on indexless bulk transfer.  The highest disk usage spikes occur when 
some of our radom-access data/indexes get shoved out of cache.  These aren't 
too large, but high enough seqscan load will cause postgres and the OS to dump 
them from cache.  If we put these on some SSD's the disk utilization % would 
drop a lot further.

I feel confident in saying that in about a year, I could spec out a medium 
sized budget for hardware ($25k) for almost any postgres setup and make it 
almost pure CPU bound.
SSDs and hybrid tech such as ZFS L2ARC make this possible with easy access to 
10k+ iops, and it it will take no more than 12 SATA drives in raid 10 next year 
(and a good controller or software raid) to get 1GB/sec sequential reads.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to