On 11/18/05 10:30 AM, "Alan Stange" <[EMAIL PROTECTED]> wrote:
> Actually, this was dual cpu and there was other activity during the full
> minute, but it was on other file devices, which I didn't include in the
> above output. Given that, and given what I see on the box now I'd
> raise the 20% to 30% just to be more conservative. It's all in the
> kernel either way; using a different scheduler or file system would
> change that result. Even better would be using direct IO to not flush
> everything else from memory and avoid some memory copies from kernel to
> user space. Note that almost none of the time is user time. Changing
> postgresql won't change the cpu useage.
These are all things that help on the IO wait side possibly, however, there
is a producer/consumer problem in postgres that goes something like this:
- Read some (small number of, sometimes 1) 8k pages
- Do some work on those pages, including lots of copies
This back and forth without threading (like AIO, or a multiprocessing
executor) causes cycling and inefficiency that limits throughput.
Optimizing some of the memcopies and other garbage out, plus increasing the
internal (postgres) readahead would probably double the disk bandwidth.
But to be disk-bound (meaning that the disk subsystem is running at full
speed), requires asynchronous I/O. We do this now with Bizgres MPP, and we
get fully saturated disk channels on every machine. That means that even on
one machine, we run many times faster than non-MPP postgres.
> One IMHO obvious improvement would be to have vacuum and analyze only do
> direct IO. Now they appear to be very effective memory flushing tools.
> Table scans on tables larger than say 4x memory should probably also use
> direct IO for reads.
That's been suggested many times prior - I agree, but this also needs AIO to
be maximally effective.
> I don't know what the system cost. It was part of block of dual
> opterons from Sun that we got some time ago. I think the 130MB/s is
> slow given the hardware, but it's acceptable. I'm not too price
> sensitive; I care much more about reliability, uptime, etc.
Then I know what they cost - we have them too (V20z and V40z). You should
be getting 400MB/s+ with external RAID.
>>> What am I doing wrong?
>>> 9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO
>>> (for a DOE lab). And now I don't know what I'm doing,
>> Cool. Would that be Sandia?
>> We routinely sustain 2,000 MB/s from disk on 16x 2003 era machines on
>> complex queries.
> Disk?! 4 StorageTek tape silos. That would be .002 TB/s. One has to
> change how you think when you have that much data. And hope you don't
> have a fire, because there's no backup. That work was while I was at
> BNL. I believe they are now at 4PB of tape and 150TB of disk.
We had 1.5 Petabytes on 2 STK Silos at NAVO from 1996-1998 where I ran R&D.
We also had a Cray T932 an SGI Origin 3000 with 256 CPUs, a Cray T3E with
1280 CPUs, 2 Cray J916s with 1 TB of shared disk, a Cray C90-16, a Sun E10K,
etc etc, along with clusters of Alpha machines and lots of SGIs. It's nice
to work with a $40M annual budget.
Later, working with FSL we implemented a weather forecasting cluster that
ultimately became the #5 fastest computer on the TOP500 supercomputing list
from 512 Alpha cluster nodes. That machine had a 10-way shared SAN, tape
robotics and a Myrinet interconnect and ran 64-bit Linux (in 1998).
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not