Re: [HACKERS] Thoughts on statistics for continuously advancing columns

Chetan Suttraway Tue, 05 Jan 2010 04:49:30 -0800

Hi,

My suggestion is to keep two sets of histograms. One which is generated by
running ANALYZE and
the other which is dynamically generated histograms using the entries from
logging (that is done
in insert/update/delete operations).
I am not sure how difficult is it to read  such record details from logs.

Basically from the details mentioned here what i understand is that the
table data (timestamp) is added
in incremental way, ie existing data is not modified to great extent and the
new data is
merely appended to old data.
In this case, the only work for analyse/statistics generation is to merge
the histograms of newly added records to old histograms.

if we can treat this case as similar to that of merging of histograms in
case of joins involving 2 tables and generating the histograms for the
cartesian (result) node, then all we need to do is somehow generate
temporary histogram for the new set of records and merge them with the old
histogram.
The information of interesting columns from the new records can be read from
the logging section.
We must be already having the part of merging of histograms and I hope that
it wont be very costly
to make these calls so as to effect planner.
(Further my opinion is to calculate this cost of histogram generation and
use it in costing in some way)

Further we can put some threshold limit to make this merge happen
automatically. Say if the temporary histograms reach some set threshold,
only then these will be merged with the older histograms.

Please pass on your inputs.

Regards,
Chetan

On Wed, Dec 30, 2009 at 8:38 AM, Tom Lane <[email protected]> wrote:

> Josh Berkus <[email protected]> writes:
> > My thoughts on dealing with this intelligently without a major change to
> > statstics gathering went along these lines:
>
> > 1. add columns to pg_statistic to hold estimates of upper and lower
> > bounds growth between analyzes.
>
> This seems like a fundamentally broken approach, first because "time
> between analyzes" is not even approximately a constant, and second
> because it assumes that we have a distance metric for all datatypes.
> (Note that convert_to_scalar does not assume that it can measure
> arbitrary distances, but only fractions *within* a histogram bucket;
> and even that is pretty shaky.)
>
> I don't have a better idea at the moment :-(
>
>                        regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list ([email protected])
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] Thoughts on statistics for continuously advancing columns

Reply via email to