Re: [PERFORM] inaccurate stats on large tables

Kiran Mukhyala Fri, 10 Oct 2008 11:46:40 -0700

On Mon, 2008-09-08 at 09:16 -0700, Matt Smiley wrote:
> Hi Kiran,
> 
> You gave great info on your problem.
> 
> First, is this the query you're actually trying to speed up, or is it a 
> simplified version?  It looks like the optimizer has already chosen the best 
> execution plan for the given query.  Since the query has no joins, we only 
> have to consider access paths.  You're fetching 58221/37909009 = 0.15% of the 
> rows, so a sequential scan is clearly inappropriate.  A basic index scan is 
> likely to incur extra scattered I/O, so a bitmap index scan is favored.


Thanks for your analysis and sorry for the long silence. 

Its a simplified version. I was tackling this part of the original query
plan since I saw that I got inaccurate stats on one of the tables. 

> 
> To improve on this query's runtime, you could try any of the following:
> 
>  - Reorganize the data to reduce this query's scattered I/O (i.e. cluster on 
> "paliasorigin_search3_idx" rather than "paliasorigin_alias_casefold_idx").  
> Bear in mind, this may adversely affect other queries.

I applied this on a different table which solved my original problem!
The query was hitting statement_timeouts but now runs in reasonable
time. I re clustered one of the tables in my actual query on a more
appropriate index.

> 
>  - Increase the cache hit frequency by ensuring the underlying filesystem 
> cache has plenty of RAM (usually so under Linux) and checking that other 
> concurrent queries aren't polluting the cache.  Consider adding RAM if you 
> think the working set of blocks required by most queries is larger than the 
> combined Postgres and filesystem caches.  If other processes than the db do 
> I/O on this machine, consider them as resource consumers, too.
> 
>  - Restructure the table, partitioning along a column that would be useful 
> for pruning whole partitions for your painful queries.  In this case, 
> origin_id or tax_id seems like a good bet, but again, consider other queries 
> against this table.  38 million rows probably makes your table around 2 GB 
> (guessing about 55 bytes/row).  Depending on the size and growth rate of the 
> table, it may be time to consider partitioning.  Out of curiosity, what 
> runtime are you typically seeing from this query?  The explain-analyze ran in 
> 113 ms, which I'm guessing is the effect of caching, not the runtime you're 
> trying to improve.

This seems inevitable eventually, if my tables keep growing in size.

>  - Rebuild the indexes on this table.  Under certain use conditions, btree 
> indexes can get horribly bloated.  Rebuilding the indexes returns them to 
> their most compact and balanced form.  For example: reindex index 
> "paliasorigin_search3_idx";  Apart from the locking and CPU usage during the 
> rebuild, this has no negative consequences, so I'd try this before something 
> drastic like partitioning.  First review the current size of the index for 
> comparison: select 
> pg_size_pretty(pg_relation_size('paliasorigin_search3_idx'));

This didn't improve the stats.
> 
> Since you asked specifically about improving the row-count estimate, like the 
> previous responder said, you should consider increasing the statistics 
> target.  This will help if individual columns are being underestimated, but 
> not if the overestimate is due to joint variation.  In other words, the 
> optimizer has no way to tell if there is there a logical relationship between 
> columns A and B such that certain values in B only occur with certain values 
> of A.  Just judging from the names, it sounds like origin_id and tax_id might 
> have a parent-child relationship, so I thought it was worth mentioning.
> 
> Do the columns individually have good estimates?
Yes.
> explain analyze select * from paliasorigin where origin_id=20;
> explain analyze select * from paliasorigin where tax_id=9606;
> 
> If not, increase the statistics on that column, reanalyze the table, and 
> recheck the selectivity estimate:
> alter table paliasorigin alter column origin_id set statistics 20;
> analyze paliasorigin;
> explain analyze select * from paliasorigin where origin_id=20;

my default_statistics_target is set to 1000 but I did set some column
specific statistics. But didn't help in this case.

Thanks a lot.

-Kiran


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] inaccurate stats on large tables

Reply via email to