On 1/18/21 4:48 PM, Dean Rasheed wrote:
Looking through extended_stats.c, I found a corner case that can lead
to a seg-fault:

CREATE TABLE foo();
CREATE STATISTICS s ON (1) FROM foo;
ANALYSE foo;

This crashes in lookup_var_attr_stats(), because it isn't expecting
nvacatts to be 0. I can't think of any case where building stats on a
table with no analysable columns is useful, so it should probably just
exit early in that case.


In BuildRelationExtStatistics(), it looks like min_attrs should be
declared assert-only.


In evaluate_expressions():

+   /* set the pointers */
+   result = (ExprInfo *) ptr;
+   ptr += sizeof(ExprInfo);

I think that should probably have a MAXALIGN().


Thanks, I'll fix all of that.


A slightly bigger issue that I don't like is the way it assigns
attribute numbers for expressions starting from
MaxHeapAttributeNumber+1, so the first expression has an attnum of
1601. That leads to pretty inefficient use of Bitmapsets, since most
tables only contain a handful of columns, and there's a large block of
unused space in the middle the Bitmapset.

An alternative approach might be to use regular attnums for columns
and use negative indexes -1, -2, -3, ... for expressions in the stored
stats. Then when storing and retrieving attnums from Bitmapsets, it
could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
in the Bitmapsets, since there can't be more than that many
expressions (just like other code stores system attributes using
FirstLowInvalidHeapAttributeNumber).

That would be a somewhat bigger change, but hopefully fairly
mechanical, and then some code like add_expressions_to_attributes()
would go away.


Well, I tried this but unfortunately it's not that simple. We still need to build the bitmaps, so I don't think add_expression_to_attributes can be just removed. I mean, we need to do the offsetting somewhere, even if we change how we do it.

But the main issue is that in some cases the number of expressions is not really limited by STATS_MAX_DIMENSIONS - for example when applying functional dependencies, we "merge" multiple statistics, so we may end up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.

Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts the order of processing (so far we've assumed expressions are after regular attnums). So the changes are more extensive - I tried doing that anyway, and I'm still struggling with crashes and regression failures. Of course, that doesn't mean we shouldn't do it, but it's far from mechanical. (Some of that is probably a sign this code needs a bit more work to polish.)

But I wonder if it'd be easier to just calculate the actual max attnum and then use it instead of MaxHeapAttributeNumber ...


Looking at the new view pg_stats_ext_exprs, I noticed that it fails to
show expressions until the statistics have been built. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON (a+b), (a*b) FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

  statistics_name | tablename | expr | n_distinct
-----------------+-----------+------+------------
  s               | foo       |      |
(1 row)

but after populating and analysing the table, this becomes:

  statistics_name | tablename |  expr   | n_distinct
-----------------+-----------+---------+------------
  s               | foo       | (a + b) |         11
  s               | foo       | (a * b) |         11
(2 rows)

I think it should show the expressions even before the stats have been built.

Another issue is that it returns rows for non-expression stats as
well. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON a, b FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

  statistics_name | tablename | expr | n_distinct
-----------------+-----------+------+------------
  s               | foo       |      |
(1 row)

and those values will never be populated, since they're not
expressions, so I would expect them to not be shown in the view.

So basically, instead of

+         LEFT JOIN LATERAL (
+             SELECT
+                 *
+             FROM (
+                 SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+                     unnest(sd.stxdexpr)::pg_statistic AS a
+             ) x
+         ) stat ON sd.stxdexpr IS NOT NULL;

perhaps just

+         JOIN LATERAL (
+             SELECT
+                 *
+             FROM (
+                 SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+                     unnest(sd.stxdexpr)::pg_statistic AS a
+             ) x
+         ) stat ON true;

Will fix.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to