Re: [HACKERS] statistics for array types

Alexander Korotkov Fri, 18 Dec 2015 05:34:45 -0800

On Wed, Sep 16, 2015 at 8:01 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:


> On Mon, Aug 24, 2015 at 8:26 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>
>> On Thu, Aug 20, 2015 at 6:00 PM, Tomas Vondra <
>> tomas.von...@2ndquadrant.com> wrote:
>>
>>> Hi,
>>>
>>> On 08/11/2015 04:38 PM, Jeff Janes wrote:
>>>
>>>> When reviewing some recent patches, I decided the statistics gathered
>>>>  for arrays had some pre-existing shortcomings.
>>>>
>>>> The main one is that when the arrays contain rare elements there is
>>>> no histogram to fall back upon when the MCE array is empty, the way
>>>> there is for scalar stats.  So it has to punt completely and resort
>>>> to saying that it is 0.5% selectivity without recourse to any data at
>>>> all.
>>>>
>>>> The rationale for applying the threshold before things are eligible
>>>> for inclusion in the MCE array seems to be that this puts some
>>>> theoretical bound on the amount of error we are likely to have in
>>>> that element.  But I think it is better to exceed that theoretical
>>>> bound than it is to have no data at all.
>>>>
>>>> The attached patch forces there to be at least one element in MCE,
>>>> keeping the one element with the highest predicted frequency if the
>>>> MCE would otherwise be empty.  Then any other element queried for is
>>>> assumed to be no more common than this most common element.
>>>>
>>>
>>> We only really need the frequency, right? So do we really need to keep
>>> the actual MCV element? I.e. most_common_elem_freqs does not have the
>>> same number of values as most_common_elems anyway:
>>>
>>>   A list of the frequencies of the most common element values, i.e., the
>>>   fraction of rows containing at least one instance of the given value.
>>>   Two or three additional values follow the per-element frequencies;
>>>   these are the minimum and maximum of the preceding per-element
>>>   frequencies, and optionally the frequency of null elements.
>>>   (Null when most_common_elems is.)
>>>
>>> So we might modify it so that it's always defined - either it tracks the
>>> same values as today (when most_common_elems is defined), or the
>>> frequency of the most common element (when most_common_elems is NULL).
>>>
>>
>> I had also considered that.  It requires more changes to make it happen,
>> and it seems to create a more complex contract on what those columns mean,
>> but without giving a corresponding benefit.
>>
>>
>>>
>>> This way we can keep the current theoretical error-bound on the MCE
>>> frequencies, and if that's not possible we can have at least the new
>>> value without confusing existing code.
>>
>>
>> But if the frequency of the most common element was grossly wrongly, then
>> whatever value we stick in there is still going to be grossly wrong.
>> Removing the value associated with it isn't going to stop it from being
>> wrong.  When we do query with the (incorrectly thought) first most common
>> element, either it will find and use the wrong value from slot 1, or it
>> will find nothing and fall back on the same wrong value from slot 3.
>>
>
> Hmm, I think we should store cutoff_freq / nonnull_cnt as minfreq when we
> collect no MCEs. Moreover, I think we should store it even when num_mcelem
> >= track_len and we haven't cut MCEs we find. In this case we can get more
> precise estimation for rare element using the knowledge that all MCEs which
> exceeds the threshold are present (assuming their frequecies could be much
> higher than threshold).
>
> When there are no MCEs then we should use assumption that there are no
> elements more frequent than cutoff_freq / nonnull_cnt. Using lower values
> wouldn't be statistically correct.
>

The patch implementing my idea above is attached. In your example it
gives following result.

# explain (analyze) select * from foobar where foo @> '{567}';
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..2387.00 rows=30 width=61) (actual
time=28.691..28.691 rows=0 loops=1)
   Filter: (foo @> '{567}'::integer[])
   Rows Removed by Filter: 100000
 Planning time: 0.044 ms
 Execution time: 28.707 ms
(5 rows)

In this particular example it gives less accurate estimate. However, I
believe it would be more safe estimate in general.

I've faced difficulties with storing empty mcelements
array. update_attstats turns empty array into null, and get_attstatsslot
throws error when trying to read null array. I've changes get_attstatsslot
so that it returns empty array when it meets null. I'm not sure in this
solution.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

array_typanalyze_0_mce.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] statistics for array types

Reply via email to