Re: n_distinct off by a factor of 1000

Klaudie Willis Tue, 23 Jun 2020 06:08:24 -0700

I didn't run it with "verbose" but otherwise, yes, several times. I can do it 
again with verbose if you are interested in the output. Just give me some time. 
500M rows 50 columns, is no small job :)


K

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, June 23, 2020 2:51 PM, Ron <[email protected]> wrote:

> Maybe I missed it, but did you run "ANALYZE VERBOSE bigtable;"?
>
> On 6/23/20 7:42 AM, Klaudie Willis wrote:
>
>> Friends,
>>
>> I run Postgresql 12.3, on Windows. I have just discovered a pretty 
>> significant problem with Postgresql and my data. I have a large table, 500M 
>> rows, 50 columns. It is split in 3 partitions by Year. In addition to the 
>> primary key, one of the columns is indexed, and I do lookups on this.
>>
>> Select * from bigtable b where b.instrument_ref in (x,y,z,...)
>> limit 1000
>>
>> It responded well with sub-second response, and it uses the index of the 
>> column. However, when I changed it to:
>>
>> Select * from bigtable b where b.instrument_ref in (x,y,z,)
>> limit 10000 -- (notice 10K now)
>>
>> The planner decided to do a full table scan on the entire 500M row table! 
>> And that did not work very well. First I had no clue as to why it did so, 
>> and when I disabled sequential scan the query immediately returned. But I 
>> should not have to do so.
>>
>> I got my first hint of why this problem occurs when I looked at the 
>> statistics. For the column in question, "instrument_ref" the statistics 
>> claimed it to be:
>>
>> The default_statistics_target=500, and analyze has been run.
>> select * from pg_stats where attname like 'instr%_ref'; -- Result: 40.000
>> select count(distinct instrumentid_ref) from bigtable -- Result: 33 385 922 
>> (!!)
>>
>> That is an astonishing difference of almost a 1000X.
>>
>> When the planner only thinks there are 40K different values, then it makes 
>> sense to switch to table scan in order to fill the limit=10.000. But it is 
>> wrong, very wrong, an the query returns in 100s of seconds instead of a few.
>>
>> I have tried to increase the statistics target to 5000, and it helps, but it 
>> reduces the error to 100X. Still crazy high.
>>
>> I understand that this is a known problem. I have read previous posts about 
>> it, still I have never seen anyone reach such a high difference factor.
>>
>> I have considered these fixes:
>> - hardcode the statistics to a particular ratio of the total number of rows
>> - randomize the rows more, so that it does not suffer from page clustering. 
>> However, this has probably other implications
>>
>> Feel free to comment :)
>>
>> K
>
> --
> Angular momentum makes the world go 'round.

Re: n_distinct off by a factor of 1000

Reply via email to