On Wed, Jun 30, 2021 at 11:15 AM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > You're right maintaining a per-partition samples and merging those might > solve (or at least reduce) some of the problems, e.g. eliminating most > of the I/O that'd be needed for sampling. And yeah, it's not entirely > clear how to merge some of the statistics types (like ndistinct). But > for a lot of the basic stats it works quite nicely, I think.
It feels like you might in some cases get very different answers. Let's say you have 1000 partitions. In each of those partitions, there is a particular value that appears in column X in 50% of the rows. This value differs for every partition. So you can imagine for example that in partition 1, X = 1 with probability 50%; in partition 2, X = 2 with probability 50%, etc. There is also a value, let's say 0, which appears in 0.5% of the rows in every partition. It seems possible that 0 is not an MCV in any partition, or in only some of them, but it might be more common overall than the #1 MCV of any single partition. -- Robert Haas EDB: http://www.enterprisedb.com