# Re: [HACKERS] WIP: multivariate statistics / proof of concept

```On 12/12/14 05:53, Heikki Linnakangas wrote:
```
```On 10/13/2014 01:00 AM, Tomas Vondra wrote:
```
`Hi,`
```
attached is a WIP patch implementing multivariate statistics.
```
```
Great! Really glad to see you working on this.

```
```+     * FIXME This sample sizing is mostly OK when computing stats for
+     *       individual columns, but when computing multi-variate stats
+     *       for multivariate stats (histograms, mcv, ...) it's rather
+     *       insufficient. For small number of dimensions it works, but
+     *       for complex stats it'd be nice use sample proportional to
+     *       the table (say, 0.5% - 1%) instead of a fixed size.
```
```
```
I don't think a fraction of the table is appropriate. As long as the sample is random, the accuracy of a sample doesn't depend much on the size of the population. For example, if you sample 1,000 rows from a table with 100,000 rows, or 1000 rows from a table with 100,000,000 rows, the accuracy is pretty much the same. That doesn't change when you go from a single variable to multiple variables.
```
```
You do need a bigger sample with multiple variables, however. My gut feeling is that if you sample N rows for a single variable, with two variables you need to sample N^2 rows to get the same accuracy. But it's not proportional to the table size. (I have no proof for that, but I'm sure there is literature on this.)
```[...]

I did stage III statistics at University many moons ago...

```
The accuracy of the sample only depends on the value of N, not the total size of the population, with the obvious constraint that N <= population size.
```
```
The standard deviation in a random sample is proportional to the square root of N. So using N = 100 would have a standard deviation of about 10%, so to reduce it to 5% you would need N = 400.
```
```
For multiple variables, it will also be a function of N - I don't recall precisely how, I suspect it might M * N were M is the number of parameters (but I'm not as certain). I think M^N might be needed if you want all the possible correlations between sets of variable to be reasonably significant - but I'm mostly just guessing here.
```
```
So using a % of table size is somewhat silly, looking at the above. However, if you want to detect frequencies that occur at the 1% level, then you will need to sample 1% of the table or greater. So which approach is 'best', depends on what you are trying to determine. The sample size is more useful when you need to decide between 2 different hypothesises.
```
```
The sampling methodology, is far more important than the ratio of N to population size - consider the bias imposed by using random telephone numbers, even before the event of mobile phones!
```

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
```