On Wed, Sep 16, 2015 at 8:01 PM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote:
> On Mon, Aug 24, 2015 at 8:26 PM, Jeff Janes <jeff.ja...@gmail.com> wrote: > >> On Thu, Aug 20, 2015 at 6:00 PM, Tomas Vondra < >> tomas.von...@2ndquadrant.com> wrote: >> >>> Hi, >>> >>> On 08/11/2015 04:38 PM, Jeff Janes wrote: >>> >>>> When reviewing some recent patches, I decided the statistics gathered >>>> for arrays had some pre-existing shortcomings. >>>> >>>> The main one is that when the arrays contain rare elements there is >>>> no histogram to fall back upon when the MCE array is empty, the way >>>> there is for scalar stats. So it has to punt completely and resort >>>> to saying that it is 0.5% selectivity without recourse to any data at >>>> all. >>>> >>>> The rationale for applying the threshold before things are eligible >>>> for inclusion in the MCE array seems to be that this puts some >>>> theoretical bound on the amount of error we are likely to have in >>>> that element. But I think it is better to exceed that theoretical >>>> bound than it is to have no data at all. >>>> >>>> The attached patch forces there to be at least one element in MCE, >>>> keeping the one element with the highest predicted frequency if the >>>> MCE would otherwise be empty. Then any other element queried for is >>>> assumed to be no more common than this most common element. >>>> >>> >>> We only really need the frequency, right? So do we really need to keep >>> the actual MCV element? I.e. most_common_elem_freqs does not have the >>> same number of values as most_common_elems anyway: >>> >>> A list of the frequencies of the most common element values, i.e., the >>> fraction of rows containing at least one instance of the given value. >>> Two or three additional values follow the per-element frequencies; >>> these are the minimum and maximum of the preceding per-element >>> frequencies, and optionally the frequency of null elements. >>> (Null when most_common_elems is.) >>> >>> So we might modify it so that it's always defined - either it tracks the >>> same values as today (when most_common_elems is defined), or the >>> frequency of the most common element (when most_common_elems is NULL). >>> >> >> I had also considered that. It requires more changes to make it happen, >> and it seems to create a more complex contract on what those columns mean, >> but without giving a corresponding benefit. >> >> >>> >>> This way we can keep the current theoretical error-bound on the MCE >>> frequencies, and if that's not possible we can have at least the new >>> value without confusing existing code. >> >> >> But if the frequency of the most common element was grossly wrongly, then >> whatever value we stick in there is still going to be grossly wrong. >> Removing the value associated with it isn't going to stop it from being >> wrong. When we do query with the (incorrectly thought) first most common >> element, either it will find and use the wrong value from slot 1, or it >> will find nothing and fall back on the same wrong value from slot 3. >> > > Hmm, I think we should store cutoff_freq / nonnull_cnt as minfreq when we > collect no MCEs. Moreover, I think we should store it even when num_mcelem > >= track_len and we haven't cut MCEs we find. In this case we can get more > precise estimation for rare element using the knowledge that all MCEs which > exceeds the threshold are present (assuming their frequecies could be much > higher than threshold). > > When there are no MCEs then we should use assumption that there are no > elements more frequent than cutoff_freq / nonnull_cnt. Using lower values > wouldn't be statistically correct. > The patch implementing my idea above is attached. In your example it gives following result. # explain (analyze) select * from foobar where foo @> '{567}'; QUERY PLAN ------------------------------------------------------------------------------------------------------- Seq Scan on foobar (cost=0.00..2387.00 rows=30 width=61) (actual time=28.691..28.691 rows=0 loops=1) Filter: (foo @> '{567}'::integer[]) Rows Removed by Filter: 100000 Planning time: 0.044 ms Execution time: 28.707 ms (5 rows) In this particular example it gives less accurate estimate. However, I believe it would be more safe estimate in general. I've faced difficulties with storing empty mcelements array. update_attstats turns empty array into null, and get_attstatsslot throws error when trying to read null array. I've changes get_attstatsslot so that it returns empty array when it meets null. I'm not sure in this solution. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
array_typanalyze_0_mce.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers