On Nov 1, 2010, at 9:15 AM, Dan Schaffer wrote: > We have an application that needs to do bulk reads of ENTIRE Postgres tables > very quickly (i.e. select * from table). We have observed that such > sequential scans run two orders of magnitude slower than observed raw disk > reads (5 MB/s versus 100 MB/s). Part of this is due to the storage overhead > we have observed in Postgres. In the example below, it takes 1 GB to store > 350 MB of nominal data. However that suggests we would expect to get 35 MB/s > bulk read rates. > > Observations using iostat and top during these bulk reads suggest that the > queries are CPU bound, not I/O bound. In fact, repeating the queries yields > similar response times. Presumably if it were an I/O issue the response > times would be much shorter the second time through with the benefit of > caching. > > We have tried these simple queries using psql, JDBC, pl/java stored > procedures, and libpq. In all cases the client code ran on the same box as > the server. > We have experimented with Postgres 8.1, 8.3 and 9.0. > > We also tried playing around with some of the server tuning parameters such > as shared_buffers to no avail. > > Here is uname -a for a machine we have tested on: > > Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20 07:12:06 > EDT 2010 x86_64 x86_64 x86_64 GNU/Linux > > A sample dataset that reproduces these results looks like the following > (there are no indexes): > > Table "bulk_performance.counts" > Column | Type | Modifiers > --------+---------+----------- > i1 | integer | > i2 | integer | > i3 | integer | > i4 | integer | > > There are 22 million rows in this case. > > We HAVE observed that summation queries run considerably faster. In this > case, > > select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts > > runs at 35 MB/s. > > > Our business logic does operations on the resulting data such that the output > is several orders of magnitude smaller than the input. So we had hoped that > by putting our business logic into stored procedures (and thus drastically > reducing the amount of data flowing to the client) our throughput would go > way up. This did not happen. > > So our questions are as follows: > > Is there any way using stored procedures (maybe C code that calls SPI > directly) or some other approach to get close to the expected 35 MB/s doing > these bulk reads? Or is this the price we have to pay for using SQL instead > of some NoSQL solution. (We actually tried Tokyo Cabinet and found it to > perform quite well. However it does not measure up to Postgres in terms of > replication, data interrogation, community support, acceptance, etc).
Have you by chance tried EXPLAIN ANALYZE SELECT * FROM bulk_performance.counts? That will throw away the query results, which removes client-server considerations. Also, when you tested raw disk IO, did you do it with an 8k block size? That's the default size of a Postgres block, so all of it's IO is done that way. What does iostat show you? Are you getting a decent number of read requests/second? -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance