Re: n_distinct off by a factor of 1000

Klaudie Willis Wed, 24 Jun 2020 00:31:02 -0700

show default_statistics_target; --> 500
ALTER TABLE public.bigtable ALTER COLUMN instrumentid_ref SET STATISTICS 5000;


Here is the output of the "ANALYZE VERBOSE bigtable;"
INFO: analyzing "public.bigtables" inheritance tree
INFO: "bigtable_y2018": scanned 622250 of 10661013 pages, containing 11994670 
live rows and 5091 dead rows; 622250 rows in sample, 205504753 estimated total 
rows
INFO: "bigtable_y2019": scanned 520159 of 8911886 pages, containing 10017582 
live rows and 6148 dead rows; 520159 rows in sample, 171631268 estimated total 
rows
INFO: "bigtable_y2020": scanned 357591 of 6126616 pages, containing 7031238 
live rows and 1534 dead rows; 357591 rows in sample, 120466385 estimated total 
rows
INFO: analyzing "public.bigtable_y2018"
INFO: "bigtable_y2018": scanned 1500000 of 10661013 pages, containing 28915115 
live rows and 12589 dead rows; 1500000 rows in sample, 205509611 estimated 
total rows
INFO: analyzing "public.bigtable_y2019"
INFO: "bigtable_y2019": scanned 1500000 of 8911886 pages, containing 28888514 
live rows and 17778 dead rows; 1500000 rows in sample, 171634096 estimated 
total rows
INFO: analyzing "public.bigtable_y2020"
INFO: "bigtable_y2020": scanned 1500000 of 6126616 pages, containing 29488967 
live rows and 6330 dead rows; 1500000 rows in sample, 120445051 estimated total 
rows
INFO: analyzing "public.bigtable_y2021"
INFO: "bigtable_y2021": scanned 1 of 1 pages, containing 8 live rows and 0 dead 
rows; 8 rows in sample, 8 estimated total rows
ANALYZE

On the comment from Adrian:

SELECT
( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,
n_distinct,
array_length(most_common_vals,1) n_mcv,
array_length(histogram_bounds,1) n_hist,
tablename,
FROM pg_stats
WHERE
schemaname = 'public'
AND attname like 'instrumentid_ref'

frac_MCV;n_distinct; n_mcv; n_hist;tablename

0.9205394 122160 2140 5001 "bigtable"

0.9203018 124312 1736 5001 "bigtable_y2018"

0.9258158 113846 2107 5001 "bigtable_y2020"

0.875 -0.375 2 "bigtable_y2021"

0.92304045 118267 2204 5001 "bigtable_y2019"

select count(distinct instrumentid_ref) from bigtable --> 33 385 922
Bigtables instrumentid_ref is underestimated by 300X even when statistics 
target of the column is 5000; Pretty weird.

K

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, June 23, 2020 3:07 PM, Klaudie Willis 
<[email protected]> wrote:

> I didn't run it with "verbose" but otherwise, yes, several times. I can do it 
> again with verbose if you are interested in the output. Just give me some 
> time. 500M rows 50 columns, is no small job :)
>
> K
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, June 23, 2020 2:51 PM, Ron <[email protected]> wrote:
>
>> Maybe I missed it, but did you run "ANALYZE VERBOSE bigtable;"?
>>
>> On 6/23/20 7:42 AM, Klaudie Willis wrote:
>>
>>> Friends,
>>>
>>> I run Postgresql 12.3, on Windows. I have just discovered a pretty 
>>> significant problem with Postgresql and my data. I have a large table, 500M 
>>> rows, 50 columns. It is split in 3 partitions by Year. In addition to the 
>>> primary key, one of the columns is indexed, and I do lookups on this.
>>>
>>> Select * from bigtable b where b.instrument_ref in (x,y,z,...)
>>> limit 1000
>>>
>>> It responded well with sub-second response, and it uses the index of the 
>>> column. However, when I changed it to:
>>>
>>> Select * from bigtable b where b.instrument_ref in (x,y,z,)
>>> limit 10000 -- (notice 10K now)
>>>
>>> The planner decided to do a full table scan on the entire 500M row table! 
>>> And that did not work very well. First I had no clue as to why it did so, 
>>> and when I disabled sequential scan the query immediately returned. But I 
>>> should not have to do so.
>>>
>>> I got my first hint of why this problem occurs when I looked at the 
>>> statistics. For the column in question, "instrument_ref" the statistics 
>>> claimed it to be:
>>>
>>> The default_statistics_target=500, and analyze has been run.
>>> select * from pg_stats where attname like 'instr%_ref'; -- Result: 40.000
>>> select count(distinct instrumentid_ref) from bigtable -- Result: 33 385 922 
>>> (!!)
>>>
>>> That is an astonishing difference of almost a 1000X.
>>>
>>> When the planner only thinks there are 40K different values, then it makes 
>>> sense to switch to table scan in order to fill the limit=10.000. But it is 
>>> wrong, very wrong, an the query returns in 100s of seconds instead of a few.
>>>
>>> I have tried to increase the statistics target to 5000, and it helps, but 
>>> it reduces the error to 100X. Still crazy high.
>>>
>>> I understand that this is a known problem. I have read previous posts about 
>>> it, still I have never seen anyone reach such a high difference factor.
>>>
>>> I have considered these fixes:
>>> - hardcode the statistics to a particular ratio of the total number of rows
>>> - randomize the rows more, so that it does not suffer from page clustering. 
>>> However, this has probably other implications
>>>
>>> Feel free to comment :)
>>>
>>> K
>>
>> --
>> Angular momentum makes the world go 'round.

Re: n_distinct off by a factor of 1000

Reply via email to