John A Meinel wrote:

Dirk Lutzebaeck wrote:

Greg Stark wrote:

I gave a bunch of "explain analyze select" commands to test estimates for
individual columns. What results do they come up with? If those are inaccurate
then raising the statistics target is a good route. If those are accurate
individually but the combination is inaccurate then you have a more difficult

After setting the new statistics target to 200 they did slightly better but not accurate. The results were attached to my last post. Here is a copy:

It does seem that setting the statistics to a higher value would help. Since rc=130170467 seems to account for almost 1/3 of the data. Probably you have other values that are much less common. So setting a high statistics target would help the planner realize that this value occurs at a different frequency from the other ones. Can you try other numbers and see what the counts are?

There is not much effect when increasing statistics target much higher. I guess this is because rc=130170467 takes a large portion of the column distribution.

I assume you did do a vacuum analyze after adjusting the statistics target.


Also interesting that in the time it took you to place these queries, you had received 26 new rows.

Yes, it's a live system...

And finally, what is the row count if you do
explain analyze select * from bi where rc=130170467::oid and co=117305223::oid;

explain analyze select * from bi where rc=130170467::oid and co=117305223::oid;
Seq Scan on bi (cost=0.00..43866.19 rows=105544 width=51) (actual time=0.402..3724.222 rows=513732 loops=1)
Filter: ((rc = 130170467::oid) AND (co = 117305223::oid))

Well both columns data take about 1/4 of the whole table. There is not much distributed data. So it needs to do full scans...

If this is a lot less than say 500k, then probably you aren't going to be helped a lot. The postgresql statistics engine doesn't generate cross column statistics. It always assumes random distribution of data. So if two columns are correlated (or anti-correlated), it won't realize that.

105k, that seems to be may problem. No much random data. Does 8.0 address this problem?

Even so, your original desire was to reduce the size of the intermediate step (where you have 700k rows). So you need to try and design a subselect on bi which is as restrictive as possible, so that you don't get all of these rows. With any luck, the planner will realize ahead of time that there won't be that many rows, and can use indexes, etc. But even if it doesn't use an index scan, if you have a query that doesn't use a lot of rows, then you won't need a lot of disk space.

I'll try that. What I have already noticed it that one of my output column is quite large so that's why it uses so much temp space. I'll probably need to sort without that output column and read it in afterwards using a subselect on the limted result.

Thanks for your help,


John =:->

explain analyze select * from bi where rc=130170467;

Seq Scan on bi (cost=0.00..41078.76 rows=190960 width=53) (actual time=0.157..3066.028 rows=513724 loops=1)
Filter: (rc = 130170467::oid)
Total runtime: 4208.663 ms
(3 rows)

explain analyze select * from bi where co=117305223;

Seq Scan on bi (cost=0.00..41078.76 rows=603988 width=53) (actual time=0.021..3692.238 rows=945487 loops=1)
Filter: (co = 117305223::oid)
Total runtime: 5786.268 ms
(3 rows)

Here is the distribution of the data in bi:
select count(*) from bi;


select count(*) from bi where rc=130170467::oid;


select count(*) from bi where co=117305223::oid;


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Reply via email to