Re: [HACKERS] multivariate statistics (v19)

Tomas Vondra Thu, 02 Feb 2017 01:00:06 -0800


On 02/01/2017 11:52 PM, Alvaro Herrera wrote:

Still looking at 0002.
pg_ndistinct_in disallows input, claiming that pg_node_tree does thesame thing. But pg_node_tree does it for security reasons: you couldcrash the backend if you supplied a malicious value. I don't think
that applies to pg_ndistinct_in. Perhaps it will be useful to inject
fake stats at some point, so why not allow it? It shouldn't be
complicated (though it does require writing some additional code, so
perhaps that's one reason we don't want to allow input of these
values).

Yes, I haven't written the code, and I'm not sure it's a very practicalway to inject custom statistics. But if we decide to allow that in thefuture, we can probably add the code.

There's a subtle difference between pg_node_tree and the data types forstatistics - pg_node_tree stores the value as a string (matching thenodeToString output), so the _in function is fairly simple. Of course,stringToNode() assumes safe input, which is why the input is disabled.

OTOH the statistics are stored in an optimized binary format, allowingto use the value directly (without having to do expensive parsing etc).

I was thinking that the easiest way to add support for _in would be toadd a bunch of Nodes for the statistics, along with in/out functions,but keeping the internal binary representation. But that'll be tricky todo in a safe way - even if those nodes are coded in a very defensiveways, I'd bet there'll be ways to inject unsafe nodes.

So I'm OK with not having the _in for now. If needed, it's possible toconstruct the statistics as a bytea using a bit of C code. That's atleast obviously unsafe, as anything written in C, touching the memory.

The comment on top of pg_ndistinct_out is missing the "_out"; also it
talks about histograms, which is not what this is about.


OK, will fix.

In the same function, a trivial point you don't need to pstrdup() the
.data out of a stringinfo; it's already palloc'ed in the right context
-- just PG_RETURN_CSTRING(str.data) and forget about "ret".  Saves you
one line.


Will fix too.

Nearby, some auxiliary functions such as n_choose_k and
num_combinations are not documented. What it is that they do? I'd
move these at the end of the file, keeping the important entry points
at the top of the file.

I'd say n-choose-k is pretty widely known term from combinatorics. Thecomment would essentially say just 'this is n-choose-k' which seemsrather pointless. So as much as I dislike the self-documenting code,this actually seems like a good case of that.

I see this patch has a estimate_ndistinct() which claims to be a re-
implementation of code already in analyze.c, but it is actually a lot
simpler than what analyze.c does.  I've been wondering if it'd be a good
idea to use some of this code so that some routines are moved out of
analyze.c; good implementations of statistics-related functions would
live in src/backend/statistics/ where they can be used both by analyze.c
and your new mvstats stuff.  (More generally I am beginning to wonder if
the new directory should be just src/backend/statistics.)

I'll look into that. I have to check if I ignored some assumptions orcorner cases the analyze.c deals with.

common.h does not belong in src/backend/utils/mvstats; IMO it should be
called src/include/utils/mvstat.h.  Also, it must not include
postgres.h, and it probably doesn't need most of the #includes it has;
those are better put into whatever include it.  It definitely needs a
guarding #ifdef MVSTATS_H around its whole content too.  An include file
is not just a way to avoid #includes in other files; it is supposed to
be a minimally invasive way of exporting the structs and functions
implemented in some file into other files.  So it must be kept minimal.


Will do.

psql/tab-complete.c compares the wrong version number (9.6 instead of
10).

Is it important to have a cast from pg_ndistinct to bytea?  I think
it's odd that outputting it as bytea yields something completely
different than as text.  (The bytea is not human readable and cannot be
used for future input, so what is the point?)

Because it internally is a bytea, and it seems useful to have theability to inspect the bytea value directly (e.g. to see the length ofthe bytea and not the string output).


In another subthread you seem to have surrendered to the opinion that
the new catalog should be called pg_statistics_ext, just in case in the
future we come up with additional things to put on it.  However, given
its schema, with a "starelid / stakeys", is it sensible to think that
we're going to get anything other than something that involves multiple
variables?  Maybe it should just be "pg_statistics_multivar" and if
something else comes along we create another catalog with an appropriate
schema.  Heck, how does this catalog serve the purpose of cross-table
statistics in the first place, given that it has room to record a single
relid only?  Are you thinking that in the future you'd change starelid
into an oidvector column?

Yes, I think the starelid will turn into OID vector. The reason why Ihaven't done that in the current version of the catalog is to keep itsimple. Supporting join statistics will require tracking OID for eachattribute, because those will be from multiple relations. It'll alsorequire tracking "join condition" and so on.

We've designed the CREATED STATISTICS syntax to support this extension,but I'm strongly against complicating the catalogs at this point.

The comment in gram.y about the CREATE STATISTICS is at odds with what
is actually allowed by the grammar.


Which comment?

I think the name of a statistics is only useful to DROP/ALTER it, right?
I wonder why it's useful that statistics belongs in a schema.  Perhaps
it should be a global object?  I suppose the name collisions would
become bothersome if you have many mvstats.

I think it shouldn't be a global object. I consider them to be a part ofa schema (just like indexes, for example). Imagine you have amulti-tenant database, with using exactly the same (tables/indexes)schema, but keept in different schemas. Why shouldn't it be possible toalso use the same set of statistics for each tenant?



T.


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics (v19)

Reply via email to