On Tue, Mar 18, 2014 at 11:43 PM, Josh Berkus <j...@agliodbs.com> wrote:

> > > On Mon, Mar 17, 2014 at 8:47 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > >> Yeah. I would like to see the planner's cost estimates extended to > >> include some sort of uncertainty estimate, whereupon risk-averse people > >> could ask it to prefer low-uncertainty plans over high-uncertainty ones > >> (the plans we typically choose for ORDER BY ... LIMIT queries being > great > >> examples of the latter). But it's a long way from wishing that to > making > >> it so. Right now it's not even clear (to me anyway) how we'd measure or > >> model such uncertainty. > > I have been thinking of some ways to have a risk estimate of each selectivity that our planner gives. I think a way to do it is as follows: One of the factors that leads to bad estimates is that the histogram of the values of a column maintained by the planner gets old by time and the data in the column changes. So, the histogram is no longer a quite accurate view of the data and it leads to bad selectivity. One thing we can try to do is to add a factor of error that we feel the selectivity given can have. This allows us to factor in the probability that the data changed and the estimate of the difference of the current histogram and the histogram of the actual data currently present in the column in the table. We can use Central Limit Theorem ( http://en.wikipedia.org/wiki/Central_limit_theorem). Essentially, what the theorem says is that given a distribution that has finite variance and finite mean, we can take random independent samples from the data and calculate the standard deviation and the mean of the sample. If we have large enough number of samples and if we plot the mean and SD, they would follow a normal distribution. What is interesting is that this can allow us to predict the SD of a given dataset from the curve and the SD should be directly proportional to the deviation it has from the given planner histogram. I am no mathematician hence its hard for me to explain. I think this link [1] will be more helpful. So, we can have a probability value for the random variable and that shall model the confidence we have in our estimate. I may be wrong in some parts but I hope I have been able to convey the general idea. If this idea develops, I shall be happy to work on this but my hands are full in ROLLUPS right now, so for my part it shall take time. I just want to float the idea and get a general feel about the idea right now. Please let me know your comments and feedback on this. Regards, Atri [1]: http://www.theriac.org/DeskReference/viewDocument.php?id=177 -- Regards, Atri *l'apprenant*