On Nov 1, 2010, at 9:15 AM, Dan Schaffer wrote:
> We have an application that needs to do bulk reads of ENTIRE Postgres tables 
> very quickly (i.e. select * from table).  We have observed that such 
> sequential scans run two orders of magnitude slower than observed raw disk 
> reads (5 MB/s versus 100 MB/s).  Part of this is due to the storage overhead 
> we have observed in Postgres.  In the example below, it takes 1 GB to store 
> 350 MB of nominal data.  However that suggests we would expect to get 35 MB/s 
> bulk read rates.
> 
> Observations using iostat and top during these bulk reads suggest that the 
> queries are CPU bound, not I/O bound.  In fact, repeating the queries yields 
> similar response times.  Presumably if it were an I/O issue the response 
> times would be much shorter the second time through with the benefit of 
> caching.
> 
> We have tried these simple queries using psql, JDBC, pl/java stored 
> procedures, and libpq.  In all cases the client code ran on the same box as 
> the server.
> We have experimented with Postgres 8.1, 8.3 and 9.0.
> 
> We also tried playing around with some of the server tuning parameters such 
> as shared_buffers to no avail.
> 
> Here is uname -a for a machine we have tested on:
> 
> Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20 07:12:06 
> EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
> 
> A sample dataset that reproduces these results looks like the following 
> (there are no indexes):
> 
> Table "bulk_performance.counts"
> Column |  Type   | Modifiers
> --------+---------+-----------
> i1     | integer |
> i2     | integer |
> i3     | integer |
> i4     | integer |
> 
> There are 22 million rows in this case.
> 
> We HAVE observed that summation queries run considerably faster.  In this 
> case,
> 
> select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts
> 
> runs at 35 MB/s.
> 
> 
> Our business logic does operations on the resulting data such that the output 
> is several orders of magnitude smaller than the input.  So we had hoped that 
> by putting our business logic into stored procedures (and thus drastically 
> reducing the amount of data flowing to the client) our throughput would go 
> way up.  This did not happen.
> 
> So our questions are as follows:
> 
> Is there any way using stored procedures (maybe C code that calls SPI 
> directly) or some other approach to get close to the expected 35 MB/s doing 
> these bulk reads?  Or is this the price we have to pay for using SQL instead 
> of some NoSQL solution.  (We actually tried Tokyo Cabinet and found it to 
> perform quite well. However it does not measure up to Postgres in terms of 
> replication, data interrogation, community support, acceptance, etc).

Have you by chance tried EXPLAIN ANALYZE SELECT * FROM bulk_performance.counts? 
That will throw away the query results, which removes client-server 
considerations.

Also, when you tested raw disk IO, did you do it with an 8k block size? That's 
the default size of a Postgres block, so all of it's IO is done that way.

What does iostat show you? Are you getting a decent number of read 
requests/second?
--
Jim C. Nasby, Database Architect                   j...@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to