Re: [HACKERS] raising the default default_statistics_target

2004-03-09 Thread Robert Treat
I think the thread you're thinking of is on or about this post:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00039.php

Manfred Koizar produced a patch that modified index correlation by
sorting equal key values based on item pointers. The patch went as far
as getting accepted into the patch queue, but Tom raised some doubts
about it and it was subsequently removed. 

Robert Treat

On Mon, 2004-03-08 at 14:41, Josh Berkus wrote:
 Tom,
 
  Are you sure you're not thinking of stats for functional indexes?
 
 Positive.I even remember seeing that the patch was accepted.
 
 The patch specifically had to do with a multi-column correlation algorithm for 
 improving the selectivity of multi-column indexes.
 
 Problem is, with 1400 posts per month August to October, I can't find it, and 
 the keywords that I think are obvious don't turn anything up.
 
 -- 
 -Josh Berkus
  Aglio Database Solutions
  San Francisco
 
 
 ---(end of broadcast)---
 TIP 2: you can get off all lists at once with the unregister command
 (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

-- 
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] raising the default default_statistics_target

2004-03-09 Thread Tom Lane
Robert Treat [EMAIL PROTECTED] writes:
 I think the thread you're thinking of is on or about this post:
 http://archives.postgresql.org/pgsql-patches/2003-12/msg00039.php
 Manfred Koizar produced a patch that modified index correlation by
 sorting equal key values based on item pointers. The patch went as far
 as getting accepted into the patch queue, but Tom raised some doubts
 about it and it was subsequently removed. 

Hm, that had nothing to do with multi-column correlation though.
I'm at a loss to think of any work that matches with Josh's
recollection.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] raising the default default_statistics_target

2004-03-09 Thread Josh Berkus
Guys,

 Hm, that had nothing to do with multi-column correlation though.
 I'm at a loss to think of any work that matches with Josh's
 recollection.

H  it's possible that early e-mails about Manfred's patch claimed to 
improve performance for multi-column indexes.

But it's also possible I'm remembering something else.

Darn it, though!  'cause multi-column correlation is one of our big issues on 
estimates for complex queries.

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] raising the default default_statistics_target

2004-03-09 Thread scott.marlowe
On Sun, 7 Mar 2004, Tom Lane wrote:

 Neil Conway [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  This is something we need to consider, but we'll need more evidence
  before making a choice.  One thing that we have very little data about
  is how much difference it makes in the quality of planner choices.
 
  Right, but is there a practical way to actually get this data?
 
 I haven't thought of one yet, but perhaps someone will have an idea.

Hi Tom.  I ran some very simple tests on analyze times and query plan 
times on a very simple table, with data randomly distributed.  The index 
was on a date field, since that's what I was testing last.

This was all done on my 512Meg memory 1.1GHz celeron workstation with an 
IDE drive.  I'd love more input on better testing methodologies here...

with 100k or 1M rows that look kinda like this:  (I'll test 10M rows 
later, which means the dataset won't fit in memory, so there'll be lots of 
access going on.  Right now the 1M row table is 80 meg)

 select * from test2 limit 5;
info | dt  |   id
-+-+-
 Francize perfectible swirling fluctuates| 2004-05-20 20:12:04 | 2721995
 Fields chauffeur attentionality grandmother | 2004-04-07 14:36:02 | 2721996
 Belgium bilked explosively defendant| 2004-09-16 16:27:22 | 2721997
 perspectives Buenos Pollux discriminates| 2004-11-11 12:28:31 | 2721998
 Victorianize Savonarola blackmails sufficed | 2004-02-27 21:17:20 | 2721999
(5 rows)

here's what I get with different statistics targets for analyze times:

100k1M  1M  
analyze analyze plan
target  ms  ms  ms  
10  250 875 2   
20  350 1250
30  430 1500
40  520 1725
50  580 1900
60  690 2100
70  775 2175
80  850 2300
90  950 2400
100 100026002.5 
200 18063700
300 26004800
400 26005900
500 26007200
700 26009500
1000260013000   5   

Since this data is randomly distributed, I didn't bother doing a lot of 
testing to see how accurate each target setting was.  If that would be 
useful to know I'd gladly test it, but I was only setting out to test the 
time to analyze and the time to plan.

Note that I only tested 3 targets for planning time, as it didn't seem to 
make a very big difference.  The query was:

select * from test2 where dt between 'march 11, 2004' and 'march 13, 2004';

I also ran some quick tests on smaller tables (1000 and 10k rows) and 
there, the plateau that we see in the 100k analyze shows up much quicker, 
at something like 50 or so.  I.e. the analyze time flattened out quickly 
and higher numbers cost very little if anything.

Since this query was quite an easy plan, I'd expect to need a much more 
complex one to test the increase in planning time, say something that has 
to look at a lot of statistics.  Any particular join type or something 
that's likely to do that?



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] raising the default default_statistics_target

2004-03-09 Thread Tom Lane
scott.marlowe [EMAIL PROTECTED] writes:
 Hi Tom.  I ran some very simple tests on analyze times and query plan 
 times on a very simple table, with data randomly distributed.  The index 
 was on a date field, since that's what I was testing last.

Thanks.

 I also ran some quick tests on smaller tables (1000 and 10k rows) and 
 there, the plateau that we see in the 100k analyze shows up much quicker, 
 at something like 50 or so.  I.e. the analyze time flattened out quickly 
 and higher numbers cost very little if anything.

The sample size is (IIRC) 300 times stats_target rows, so the plateau
that you're seeing occurs when the sample size becomes the entire table.
It would be useful to note how large the ANALYZE process got to be during
these runs.

 Since this query was quite an easy plan, I'd expect to need a much more 
 complex one to test the increase in planning time, say something that has 
 to look at a lot of statistics.  Any particular join type or something 
 that's likely to do that?

I'd say try a join on any reasonably plausible foreign-key relationship
(unique key on one side, not-unique data on the other).  That's probably
the most common situation.  As for making it complicated, just stack up
a bunch of such joins ...

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] raising the default default_statistics_target

2004-03-08 Thread Josh Berkus
Neil,

 In the simple test I performed, raising the default_statistics_target 
 from 10 to 25 resulted in a 40% increase in the time to ANALYZE a 
 large table. (I picked 25 more or less at random -- would 15 or 20 be 
 better?)

I find that very interesting, since I haven't found much higher increases to 
be a proportionate penality.   For example, on an 11-column table raising 3 
columns to statistics=250 merely doubled the ANALYZE time.  I have not done 
exact timing, but would be happy to 

It may also be the case that for those people for whom 10 
 is an insufficient stats target, 25 is also insufficient.

It is.   I've found that problem queries, especially those caused by real, 
uneven distribution of data, require raising statistics to 150-400 in order 
to fix.  This is much to high a level to assign as a default.

 Any comments on whether increasing the default stats target is a good 
 idea for 7.5? (Details on the test I performed are included below)

No.   I don't think it's a good idea to raise the default for *all* columns; 
for one thing, I'd really hate to think what, say, a default stats of 100 
would do to a TEXT column with an average of 8K of data per row.

Further, in 7.5 we'll be introducing correlated stats for multi-column indexes 
(unless something's gone off with that?) which should help a lot of problem 
queries.  And change our whole emphasis on brute forcing analyze through 
increasing stats into the 100's.   

If you really want to tackle this issue, though, here's what I suggest:

1) add a GUC called default_statistics_indexed, which starts at say 100 or 50.
2) When ever the user indexes a column, automatically increase the stats
to the level in default_statistics_indexed, if they are at the level in 
default_statistics_target.

This will then give indexed columns automatically a somewhat higher level of 
stats analysis than other columns.   This should help a lot of slow query 
problems, yet effectively leave the selection of important columns in the 
hands of the DBA.   Make sense?

Also, another great feature in this department would be to extend the 
multi-column correlation statistics to cover foriegn keys, as a way of 
improving cross-table estimates.

Anyway, keep me in the loop on this, I have a lot of very complex databases I 
can test such issues on.

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] raising the default default_statistics_target

2004-03-08 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes:
 It is.   I've found that problem queries, especially those caused by real, 
 uneven distribution of data, require raising statistics to 150-400 in order 
 to fix.  This is much to high a level to assign as a default.

That's basically what's bothering me about the suggestion to increase to
25 --- I'm dubious that it will do any good.

 Further, in 7.5 we'll be introducing correlated stats for multi-column indexes 
 (unless something's gone off with that?)

News to me.  It's certainly not there now.

 This will then give indexed columns automatically a somewhat higher
 level of stats analysis than other columns.

That is potentially a good idea.  There's still the question of what is
a reasonable default, though.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] raising the default default_statistics_target

2004-03-08 Thread Bruce Momjian
Tom Lane wrote:
 Josh Berkus [EMAIL PROTECTED] writes:
  It is.   I've found that problem queries, especially those caused by real, 
  uneven distribution of data, require raising statistics to 150-400 in order 
  to fix.  This is much to high a level to assign as a default.
 
 That's basically what's bothering me about the suggestion to increase to
 25 --- I'm dubious that it will do any good.
 
  Further, in 7.5 we'll be introducing correlated stats for multi-column indexes 
  (unless something's gone off with that?)
 
 News to me.  It's certainly not there now.
 
  This will then give indexed columns automatically a somewhat higher
  level of stats analysis than other columns.
 
 That is potentially a good idea.  There's still the question of what is
 a reasonable default, though.

Do all the columns have to have the same number of statistics buckets? 
Could that stats collector adjust the number of buckets based on the
data somehow?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] raising the default default_statistics_target

2004-03-08 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 Do all the columns have to have the same number of statistics buckets? 

They do not, but the effort spent by ANALYZE is proportional to the
largest stats target among all the columns of the table.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] raising the default default_statistics_target

2004-03-08 Thread Josh Berkus
Tom,

  Further, in 7.5 we'll be introducing correlated stats for multi-column 
indexes 
  (unless something's gone off with that?)

This was discussed on Hackers in October, a complete implementation was shown, 
I thought it was committed at that time.   If not, what happened to it?

Dammit, it's impossible to find anything in the archives if you don't have 
some good keywords or at least the author.  Is the autor reading this?   Will 
you speak up?

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] raising the default default_statistics_target

2004-03-08 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Do all the columns have to have the same number of statistics buckets? 
 
 They do not, but the effort spent by ANALYZE is proportional to the
 largest stats target among all the columns of the table.

Could we use previous stats to determine how many buckets to use when
running ANALYZE.  Also, if columns have a different number of buckets,
does that mean that we don't have the same per-query overhead for a
larger stats target?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] raising the default default_statistics_target

2004-03-08 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes:
 Further, in 7.5 we'll be introducing correlated stats for multi-column 
 indexes (unless something's gone off with that?)

 This was discussed on Hackers in October, a complete implementation was shown, 
 I thought it was committed at that time.   If not, what happened to it?

Are you sure you're not thinking of stats for functional indexes?

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] raising the default default_statistics_target

2004-03-08 Thread Josh Berkus
Tom,

 Are you sure you're not thinking of stats for functional indexes?

Positive.I even remember seeing that the patch was accepted.

The patch specifically had to do with a multi-column correlation algorithm for 
improving the selectivity of multi-column indexes.

Problem is, with 1400 posts per month August to October, I can't find it, and 
the keywords that I think are obvious don't turn anything up.

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] raising the default default_statistics_target

2004-03-07 Thread Tom Lane
Neil Conway [EMAIL PROTECTED] writes:
 Any comments on whether increasing the default stats target is a good 
 idea for 7.5? (Details on the test I performed are included below)

This is something we need to consider, but we'll need more evidence
before making a choice.  One thing that we have very little data about
is how much difference it makes in the quality of planner choices.
(There's no point in slowing down ANALYZE unless the plans get better.)

Also, I would expect that larger stats targets would slow down the parts
of the planner that look at the stats, since there are more data values
to examine.  I do not have any numbers about this cost though --- do you
want to try to get some?

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] raising the default default_statistics_target

2004-03-07 Thread Neil Conway
Tom Lane wrote:
This is something we need to consider, but we'll need more evidence
before making a choice.  One thing that we have very little data about
is how much difference it makes in the quality of planner choices.
Right, but is there a practical way to actually get this data?

If the distribution of data in the table is irregular, a higher stats 
target is needed to allow good planning choices. Therefore, the effect 
that the stats target has on planner choices depends on the regularity 
of the distribution of data at installations, and there is no way to 
know that in general AFAICS.

Also, I would expect that larger stats targets would slow down the parts
of the planner that look at the stats, since there are more data values
to examine.  I do not have any numbers about this cost though --- do you
want to try to get some?
Given the magnitude of the change (25 data elements versus 10), I 
wouldn't expect this to produce a major change in the total runtime of 
the optimizer. However, I don't know the optimizer that well, so I'll 
do some benchmarks when I get a chance.

-Neil

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] raising the default default_statistics_target

2004-03-07 Thread Tom Lane
Neil Conway [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 This is something we need to consider, but we'll need more evidence
 before making a choice.  One thing that we have very little data about
 is how much difference it makes in the quality of planner choices.

 Right, but is there a practical way to actually get this data?

I haven't thought of one yet, but perhaps someone will have an idea.


 Also, I would expect that larger stats targets would slow down the parts
 of the planner that look at the stats, since there are more data values
 to examine.  I do not have any numbers about this cost though --- do you
 want to try to get some?

 Given the magnitude of the change (25 data elements versus 10), I 
 wouldn't expect this to produce a major change in the total runtime of 
 the optimizer.

I wouldn't either, but if we need to raise the stats target to 100 or
1000 to make a meaningful difference, then the question becomes more
urgent.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings