>From: Brandon Black <[EMAIL PROTECTED]>
>Sent: Sep 12, 2005 5:04 PM
>To: pgsql-performance@postgresql.org
>Subject: [PERFORM] Performance considerations for very heavy INSERT traffic

>I'm in the process of developing an application which uses PostgreSQL for 
>data storage. Our database traffic is very atypical, and as a result it has 
>been rather challenging to figure out how to best tune PostgreSQL on what 
>development hardware we have, as well as to figure out exactly what we 
>should be evaluating and eventually buying for production hardware.

>The vast, overwhelming majority of our database traffic is pretty much a 
>non-stop stream of INSERTs filling up tables. It is akin to data 
>acquisition. Several thousand clients are sending once-per-minute updates 
>full of timestamped numerical data at our central server, which in turn 
>performs INSERTs into several distinct tables as part of the transaction for 
>that client. We're talking on the order of ~100 transactions per second, 
>each containing INSERTs to multiple tables (which contain only integer and 
>floating point columns and a timestamp column - the primary key (and only 
>index) is on a unique integer ID for the client and the timestamp). The 
>transaction load is spread evenly over time by having the clients send their 
>per-minute updates at random times rather than on the exact minute mark.

I have built two such systems.  TBF, neither used PostgreSQL.  OTOH, the 
principles are the same.

One perhaps non-obvious point: You definitely are going to want a way to adjust 
exactly when a specific client is sending its approximately once-per-minute 
update via functionality similar to adjtime().  Such functionality will be 
needed to smooth the traffic across the clients as much as possible over the 1 
minute polling period given the real-world vagracies of WAN connections.

Put your current active data acquisition table on its own RAID 10 array, and 
keep it _very_ small and simple in structure (see below for a specific 
suggestion regarding this).  If you must have indexes for this table, put them 
on a _different_ array.  Basically, you are going to treat the active table 
like you would an log file: you want as little HD head movement as possible so 
appending to the table is effectively sequential IO.

As in a log file, you will get better performance if you batch your updates to 
HD into reasonably sized chunks rather than writing to HD every INSERT.

The actual log file will have to be treated in the same way to avoid it 
becoming a performance bottleneck.

>There will of course be users using a web-based GUI to extract data from 
>these tables and display them in graphs and whatnot, but the SELECT query 
>traffic will always be considerably less frequent and intensive than the 
>incessant INSERTs, and it's not that big a deal if the large queries take a 
>little while to run.

More details here would be helpful.

>This data also expires - rows with timestamps older than X days will be 
>DELETEd periodically (once an hour or faster), such that the tables will 
>reach a relatively stable size (pg_autovacuum is handling vacuuming for now, 
>but considering our case, we're thinking of killing pg_autovacuum in favor 
>of having the periodic DELETE process also do a vacuum of affected tables 
>right after the DELETE, and then have it vacuum the other low traffic tables 
>once a day while it's at it).

Design Idea: split the data into tables where _at most_ the tables are of the 
size that all the data in the table expires at the same time and DROP the 
entire table rather than scanning a big table for deletes at the same time you 
want to do inserts to said.  Another appropriate size for these tables may be 
related to the chunk you want to write INSERTS to HD in.  

This will also have the happy side effect of breaking the data into smaller 
chunks that are more likely to be cached in their entirety when used. 

>There is an aggregation layer in place which proxies the inbound data from 
>the clients into a small(er) number of persistent postgresql backend 
>processes. Right now we're doing one aggregator per 128 clients (so instead 
>of 128 seperate database connections over the course of a minute for a small 
>transaction each, there is a single database backend that is constantly 
>committing transactions at a rate of ~ 2/second). At a test load of ~1,000 
>clients, we would have 8 aggregators running and 8 postgresql backends. 
>Testing has seemed to indicate we should aggregate even harder - the planned 
>production load is ~5,000 clients initially, but will grow to almost double 
>that in the not-too-distant future, and that would mean ~40 backends at 128 
>clients each initially. Even on 8 cpus, I'm betting 40 concurrent backends 
>doing 2 tps is much worse off than 10 backends doing 8 tps.

Experience has taught me that the above is not likely to be the proper 
architecture for this kind of application.

The best exact approach is dependent on The Details, but in general you want to 
optimize the amount of data sent from NIC to CPU per transfer (multiple small 
copies and lots of interrupts _kill_ system performance) and use a combined 
Threading and Event Queue model with Processor Affinity being used to optimize 
the NIC <-> CPU path for a given NIC.  Have each physical NIC+CPU Affinity set 
be hidden underneath an overall Virtual NIC+CPU abstraction for load balancing 
purposes.  The traffic thinks it's talking to one NIC attached to one CPU.

This architecture allowed for a 4P Xeon system (Compaq Proliant 8500s) to 
handle 250K+ simultaneous active web connections.  In 2000 using Windows 2K.  
As HW + SW has gotten better, it has scaled very well.  It's a very high 
performance model.

Such an architecture using AMD64 CPUs running a 64b Linux 2.6 distro now in 
late 2005 should _easily_ handle your projected demand with plenty of room for 

>Test hardware right now is a dual Opteron with 4G of ram, which we've barely 
>gotten 1,000 clients running against. Current disk hardware in testing is 
>whatever we could scrape together (4x 3-ware PCI hardware RAID controllers, 
>with 8 SATA drives in a RAID10 array off of each - aggregated up in a 4-way 
>stripe with linux md driver and then formatted as ext3 with an appropriate 
>stride parameter and data=writeback). Production will hopefully be a 4-8-way 
>Opteron, 16 or more G of RAM, and a fiberchannel hardware raid array or two 
>(~ 1TB available RAID10 storage) with 15krpm disks and battery-backed write 

For deployment, get a 2P or 4P 16 DIMM slot mainboard and stuff it with at 
least 16GB of RAM.  A 2P 16 DIMM slot mainboard like the IWill DK88 (Tyan and a 
few others also have such boards in the works) is IMHO your best bet.

IME, you will not need CPU for this application as much as you will need RAM 
first and HD IO bandwidth second.
Get a 1C2P AMD64 system and don't worry about CPU until you are CPU bound.  I 
highly doubt you will need more CPU than a 2C2P (4 cores total) based on say 
4400+'s, 4600+'s, or 4800+'s can provide.

>I know I haven't provided a whole lot of application-level detail here, but 
>does anyone have any general advice on tweaking postgresql to deal with a 
>very heavy load of concurrent and almost exclusively write-only 
>transactions? Increasing shared_buffers seems to always help, even out to 
>half of the dev box's ram (2G). A 100ms commit_delay seemed to help, but 
>tuning it (and _siblings) has been difficult. We're using 8.0 with the 
>default 8k blocksize, but are strongly considering both developing against 
>8.1 (seems it might handle the heavy concurrency better), and re-compiling 
>with 32k blocksize since our storage arrays will inevitably be using fairly 
>wide stripes. Any advice on any of this (other than drop the project while 
>you're still a little bit sane)?

Since you are doing more writing than reading, and those writes are going to be 
relatively small, you may not get as much out of block sizes larger than the 
average DB write.

My initial instinct on this point is that you should keep the size of the 
"chunk" the same from NIC to CPU to HD, and make said chunk as large as 

Hope this helps,
Ron Peacetree

PS.  Yes, I'm available under the proper circumstances for a consulting gig.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not

Reply via email to