Re: [PERFORM] Slow performance when querying millions of rows

Craig McIlwee Tue, 28 Jun 2011 16:51:24 -0700

> On 06/28/2011 05:28 PM, Craig McIlwee wrote:
> > Autovacuum is disabled for these tables since the data is never 
> > updated.  The tables that we are testing with at the moment will not 
> > grow any larger and have been both clustered and analyzed.
> 
> Note that any such prep to keep from ever needing to maintain these 
> tables in the future should include the FREEZE option, possibly with 
> some parameters tweaked first to make it more aggressive.  Autovacuum 
> will eventually revisit them in order to prevent transaction ID 
> wrap-around, even if it's disabled.  If you're going to the trouble of 
> prepping them so they are never touched again, you should do a freeze 
> with the right parameters to keep this from happening again.
> 
> > work_mem: 512MB
> > shared_buffers: 64MB, 512MB, and 1024MB, each yielded the same query 
> > plan and took the same amount of time to execute give or take a few 
> > seconds
> 
> shared_buffers doesn't normally impact the query plan; it impacts how 
> much churn there is between the database and the operating system cache, 
> mainly important for making write-heavy work efficient.  On Windows, 
> you'll probably be safe to set this to 512MB and forget about it.  It 
> doesn't benefit from large values anyway.


I was thinking that shared buffers controlled the amount of data, primarily 
table and index pages, that the database could store in memory at once.  Based 
on that assumption, I thought that a larger value would enable an entire table 
+ index to be in memory together and speed up the query.  Am I wrong?

> 
> This is a very large work_mem setting however, so be careful that you 
> won't have many users connecting at once if you're going to use it.  
> Each connection can use a multiple of work_mem, making it quite possible 
> you could run out of memory with this configuration.  If that low user 
> count is true, you may want to make sure you're enforcing it by lowering 
> max_connections, as a safety measure to prevent problems.

I plan on lowering this quite a bit since I haven't seen much of a boost by 
increasing it.

> 
> > Since the daily tables are only 360MB, I would hope that the entire 
> > table could be pulled into memory with one large sequential read.  Of 
> > course this assumes that the file pieces are stored contiguously, but 
> > auto defrag is enabled and shows low fragmentation so Im trusting (as 
> > much as one can) Windows to do the right thing here.  My drives have a 
> > 150MB/s sustained max throughput, and considering that data is spread 
> > across 5 drives I would hope to at least be able to reach the single 
> > disk theoretical limit and read an entire table plus the index into 
> > memory about 4 to 5 seconds.  Based on the analyze output, each daily 
> > table averages 6 to 7 seconds, so Im pretty close there and maybe just 
> > limited by disk speed?
> 
> One thing to note is that your drive speed varies based on what part of 
> the disk things are located at; the slower parts of the drive will be 
> much less than 150MB/s.
> 
> On Linux servers it's impossible to reach something close to the disk's 
> raw speed without making the operating system read-ahead feature much 
> more aggressive than it is by default.  Because PostgreSQL fetches a 
> single block at a time, to keep the drive completely busy something has 
> to notice the pattern of access and be reading data ahead of when the 
> database even asks for it.  You may find a parameter you can tune in the 
> properties for the drives somewhere in the Windows Control Panel.  And 
> there's a read-ahead setting on your PERC card that's better than 
> nothing you may not have turned on (not as good as the Linux one, but 
> it's useful).  There are two useful settings there ("on" and "adaptive" 
> if I recall correctly) that you can try, to see which works better.

Looks like they are set to adaptive read-ahead now.  If the database is 
executing many concurrent queries, is it reasonable to suspect that the IO 
requests will compete with each other in such a way that the controller would 
rarely see many sequential requests since it is serving many processes?  The 
controller does have an 'on' option also that forces read-ahead, maybe that 
would solve the issue if we can rely on the data to survive in the cache until 
the actual read request takes place.

> 
> > Ive read documentation that says I should be able to set statistics 
> > values for an entire table as opposed to per column, but havent found 
> > how to do that.  I guess I was either too lazy to update statistics on 
> > each column or just didnt think it would help much.
> 
> You can adjust the statistics target across the entire database using 
> the default_statistics_target setting, or you can tweak them per column 
> using ALTER TABLE.  There is no table-level control.  I find it 
> difficult to answer questions about whether there is enough stats or not 
> without actually looking at pg_stats to see how the database is 
> interpreting the data, and comparing it against the real distribution.  
> This is an area where flailing about trying things doesn't work very 
> well; you need to be very systematic about the analysis and testing 
> strategy if you're going to get anywhere useful.  It's not easy to do.
> 
> As a larger commentary on what you're trying to do, applications like 
> this often find themselves at a point one day where you just can't allow 
> arbitrary user queries to run against them anymore.  What normally 
> happens then is that the most common things that people really need end 
> up being run one and stored in some summary form, using techniques such 
> as materialized views:  http://wiki.postgresql.org/wiki/Materialized_Views
> 
> In your case, I would start now on trying to find the common patters to 
> the long running reports that people generate, and see if it's possible 
> to pre-compute some portion of them and save that summary.  And you may 
> find yourself in a continuous battle with business requests regardless.  
> It's often key decision makers who feel they should be able to query any 
> way they want, regardless of its impact on the database.  Finding a 
> middle ground there is usually challenging.
> 
> -- 
> Greg Smith   2ndQuadrant US    [email protected]   Baltimore, MD
> Comprehensive and Customized PostgreSQL Training Classes:
> http://www.2ndquadrant.us/postgresql-training/
> 
> 
> -- 
> Sent via pgsql-performance mailing list ([email protected])
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
> 

Thanks,
Craig

This e-mail communication (including any attachments) may contain confidential 
and/or privileged material intended solely for the individual or entity to 
which it is addressed.
P - Think before you print.

Re: [PERFORM] Slow performance when querying millions of rows

Reply via email to