# Re: [HACKERS] Collect frequency statistics for arrays

```Hi!

Thanks for your fixes to the patch. Them looks correct to me. I did some
fixes in the patch. The proof of some concepts is still needed. I'm going
to provide it in a few days.```
```
On Thu, Jan 12, 2012 at 3:06 PM, Noah Misch <n...@leadboat.com> wrote:

> > I'm not sure about shared lossy counting module, because part of shared
> > code would be relatively small. Part of compute_array_stats function
> which
> > is taking care about array decompression, distinct occurence calculation,
> > disting element count histogram, packing statistics slots etc is much
> > larger than lossy counting algorithm itself. May be, there is some other
> > opinions in community?
>
> True; it would probably increase total lines of code.  The benefit, if any,
> lies in separation of concerns; the business of implementing this
> algorithm is
> quite different from the other roles of these typanalyze functions.  I
> won't
> insist that you try it, though.
>
I'd prefer to try it as separate patch.

> > > > +             /*
> > > > +              * The probability of no occurence of events which
> forms
> > "rest"
> > > > +              * probability have a limit of exp(-rest) when number
> of
> > events fo to
> > > > +              * infinity. Another simplification is to replace that
> > events with one
> > > > +              * event with (1 - exp(-rest)) probability.
> > > > +              */
> > > > +             rest = 1.0f - exp(-rest);
> > >
> > > What is the name of the underlying concept in probability theory?
> >
> > The most closest concept to caculated distribution is multinomial
> > distribution. But it's not exactly same, because multinomial distribution
> > gives probability of particular count of each event occurece, not
> > probability of summary occurence. Actually, distribution is caclulated
> just
> > from assumption of events independence. The most closest concept of rest
> > probability is approximation by exponential distribution. It's quite
> rough
> > approximation, but I can't invent something better with low calculation
> > complexity.
>
> Do you have a URL of a tutorial or paper that explains the method in more
> detail?  If, rather, this is a novel synthesis, could you write a proof to
>
Unfortunately I don't have relevant paper for it. I'll try to search
it. Otherwise I'll try to do some proof.

>  > > > + /*
> > > > +  * Array selectivity estimation based on most common elements
> > statistics for
> > > > +  * "column <@ const" case. Assumption that element occurences are
> > independent
> > > > +  * means certain distribution of array lengths. Typically real
> > distribution
> > > > +  * of lengths is significantly different from it. For example, if
> > even we
> > > > +  * have set of arrays with 1 integer element in range [0;10] each,
> > element
> > > > +  * occurences are not independent. Because in the case of
> > independence we
> > >
> > > Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
> > > '{6,19,4}' cannot appear?
> >
> > I refer column where only one element exists, i.e. only possible values
> are
> > '{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}',
> > '{10}'. That is a corner case. But similar situation occurs when, for
> > example, we've distribution of distinct element count between 1 and 3. It
> > significantly differs from distribution from independent occurence.
>
> Oh, I think I see now.  If each element 1..10 had frequency 0.1
> independently,
> column values would have exactly one distinct element just 39% of the time?
>
Yes, it's right.

> If probability theory has a prototypical problem resembling this, it would
> be
> nice to include a URL to a thorough discussion thereof.  I could not think
> of
> the search terms to find one, though.
>
Actually, usage of both distinct element count histogram and element
occurrence frequencies produce some probability distribution (which is more
complex than just independent element occurrence). If real distribution is
close this distribution, then estimate is accurate. I didn't met such
distributions is papers, actually I've just invented it in my tries to do
accurate "column <@ const" estimation at least in simple cases. I'll try to
search for similar things in papers, but I doubt I'll succeed. Otherwise
I'll try to do some more detailed proof.

> > *** /dev/null
>
> > + Selectivity
> > + calc_scalararraysel(VariableStatData *vardata, Datum constval, bool
> orClause,
> > +                                     Oid operator)
> > + {
> > +     Oid                     elemtype;
> > +     Selectivity selec;
> > +     TypeCacheEntry *typentry;
> > +     Datum      *hist;
> > +     int                     nhist;
> > +     FunctionCallInfoData cmpfunc;
> > +
> > +     elemtype = get_base_element_type(vardata->vartype);
> > +
> > +
> > +     /* Get default comparison function */
> > +     typentry = lookup_type_cache(elemtype,
> > +                TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO |
> TYPECACHE_EQ_OPR);
> > +
> > +     /* Handle only "=" operator. Return default selectivity in other
> cases. */
> > +     if (operator != typentry->eq_opr)
> > +             return (Selectivity) 0.5;
>
> Punting on other operators this way creates a plan quality regression for
> operations like "const < ANY (column)".  Please do it some way that falls
> back on the somewhat-better existing scalararraysel() treatment for this.
>
I've made calc_scalararraysel return -1 in this case or in the case of no
comparison function. scalararraysel continues it's work
when calc_scalararraysel returns negative value.

> > + /*
> > +  * Calculate first n distinct element counts probabilities by
> histogram. We
> > +  * assume that any interval between a and b histogram values gives
> > +  * 1 / ((b - a + 1) * (nhist - 1)) probability to values between a and
> b and
> > +  * half of that to a and b. Returns total probability that distinct
> element
> > +  * count is less of equal to n.
> > +  */
> > + static float
> > + calc_hist(Datum *hist, int nhist, float *hist_part, int n)
>
> To test this function, I ran the following test case:
>
> set default_statistics_target = 4;
> create table t3 as select array(select * from generate_series(1, v)) as arr
> from (values (2),(2),(2),(3),(5),(5),(5)) v(v), generate_series(1,100);
> analyze t3; -- length_histogram_bounds = {2,2,5,5}
> select * from t3 where arr <@ array[6,7,8,9,10,11];
>
> Using gdb to observe calc_hist()'s result during the last command:
>
> (gdb) p calc_hist(hist, nhist, hist_part, unique_nitems)
> \$23 = 0.666666687
> (gdb) x/6f hist_part
> 0xcd4bc8:       0       0       0.333333343     0
> 0xcd4bd8:       0       0.333333343
>
> I expected an equal, nonzero probability in hist_part[3] and hist_part[4]
> and
> a total probability of 1.0.
>
> > + {
> > +     int                     k,
> > +                             i = 0,
> > +                             prev_interval = 0,
> > +                             next_interval = 0;
> > +     float           frac,
> > +                             total = 0.0f;
> > +
> > +     /*
> > +      * frac is a probability contribution by each interval between
> histogram
> > +      * values. We have nhist - 1 intervals. Contribution of one will
> be 1 /
> > +      * (nhist - 1).
> > +      */
> > +     frac = 1.0f / ((float) (nhist - 1));
> > +     for (k = 0; k <= n; k++)
> > +     {
> > +             int                     count = 0;
> > +
> > +             /* Count occurences of k distinct element counts in
> histogram. */
> > +             while (i < nhist && DatumGetInt32(hist[i]) <= k)
> > +             {
> > +                     if (DatumGetInt32(hist[i]) == k)
> > +                             count++;
> > +                     i++;
> > +             }
> > +
> > +             if (count > 0)
> > +             {
> > +                     float           val;
> > +
> > +                     /* Find length between current histogram value and
> the next one */
> > +                     if (i < nhist)
> > +                             next_interval = DatumGetInt32(hist[i + 1])
> -
>
> Doesn't this read past the array end when i == nhist - 1?

It was a bug. It also causes wrong histogram calculation you noted above.
Fixed.

> > +     stats->extra_data = extra_data->std_extra_data;
> > +     old_context = CurrentMemoryContext;
> > +     extra_data->std_compute_stats(stats, fetchfunc, samplerows,
> totalrows);
> > +     MemoryContextSwitchTo(old_context);
>
> Is the callee known to change CurrentMemoryContext and not restore it?
> Offhand, I'm not seeing how it could do so.
>
Right. Saving of memory context is not needed. Removed.

> *** a/src/include/catalog/pg_type.h
> > --- b/src/include/catalog/pg_type.h
>
> This now updates all array types except record[].  I'm don't know offhand
> how
> to even make a non-empty value of type record[], let alone get it into a
> context where ANALYZE would see it.  However, is there a particular reason
> to
> make that one different?
>
Oh, I didn't update all array types in 2 tries :) Fixed.

------
With best regards,
Alexander Korotkov.
```

arrayanalyze-0.11.patch.gz
Description: GNU Zip compressed data

```--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
```