However, by realizing that the bounds on the ranges have a linear ordering
one can speed this up to 0(m) using windowing functions on common table
expressions.
So what I am proposing is formalizing this optimization into a class of data
types, that will hide the implementation details.
What about MCV's? Will those be removed as well?
Sure. Those seem even less useful.
Ya, this will destroy the performance of several queries without some
heavy tweaking.
Maybe this is bad design, but I've gotten in the habit of storing
sequences as arrays and I commonly join on them. I
[ sorry Tom, reply all this time... ]
What do you mean by storing sequences as arrays?
So, a simple example is, for transcripts ( sequences of DNA that are
turned into proteins ), we store each of the connected components as
an array of the form:
exon_type in [1,6]
splice_type = [1,3]
and
On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote:
Alexander Korotkov aekorot...@gmail.com writes:
On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch n...@leadboat.com wrote:
I've attached a new version that includes the UINT64_FMT fix, some edits of
your newest comments, and a
On Wed, Feb 29, 2012 at 2:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
Nathan Boley npbo...@gmail.com writes:
On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote:
I am starting to look at this patch now. I'm wondering exactly why the
decision was made to continue storing btree
I actually don't know much about the I/O subsystem, but, no, WAL is
not separated from data. I believe $PGDATA is on a SAN, but I don't
know anything about its characteristics.
The computer is here:
http://www.supermicro.nl/Aplus/system/2U/2042/AS-2042G-6RF.cfm
$PGDATA is on a 5 disk SATA
Version of patch with few more comments and some fixes.
Where are we on the idea of better statistics for arrays?
I need to finish reviewing the patch: I'll try to send in something
tomorrow night. So far it looks good.
Best,
Nathan
--
Sent via pgsql-hackers mailing list
Rebased with head.
FYI, I've added myself as the reviewer for the current commitfest.
Best,
Nathan Boley
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Looking now, I see that Alexander wasn't Cc'd on the review, so it's
possible he missed the message?
We've corresponded off list and have discussed my review at some length.
Alex submitted an updated patch on Sep 22 to me personally ( although
not to the list? Alex? ), with the promise of a
Actually, not MCV slot is used but MCELEM. It is a different slot. ps_stats
view map both into same fileds.
Yes, you're right. I am sorry about the error.
Surely, non-standard use of histogram slot
should be avoided.
Agreed.
Best,
Nathan
--
Sent via pgsql-hackers mailing list
Review of patch: https://commitfest.postgresql.org/action/patch_view?id=539
I glanced through the code and played around with the feature and,
in general, it looks pretty good. But I have a some comments/questions.
Design:
First, it makes me uncomfortable that you are using the MCV and
Personally, I think the first thing we ought to do is add a real, bona
fide planner hint to override the selectivity calculation manually,
maybe something like this:
WHERE (x 5 AND y = 1) SELECTIVITY (0.1);
If you're going to go that far, why not just collect statistics on
that specific
FWIW, a very informal survey of probabilists didn't yield any reason
for trying to put an order on the empty set ( unless the metric was
cardinality or other equivalence relation ).
I think the problem here is that the idea of union and intersection
forming a ring over sets is being conflated
On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
Dne 19.1.2011 20:25, Robert Haas napsal(a):
On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
Yes, I was aware of this problem (amount of memory consumed with lots of
updates), and I kind of hoped someone will
If you think about it, it's a bit ridiculous to look at the whole table
*just* to estimate ndistinct - if we go that far why dont we just
store the full distribution and be done with it?
- the best you could do is to average the
individual probabilities which gives ... well, 1/ndistinct.
to get it committed ( check
the archives for cross column stat threads - there are a lot ).
Best,
Nathan Boley
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jan 19, 2011 at 6:35 PM, Florian Pflug f...@phlo.org wrote:
On Jan20, 2011, at 02:41 , Nathan Boley wrote:
If you think about it, it's a bit ridiculous to look at the whole table
*just* to estimate ndistinct - if we go that far why dont we just
store the full distribution and be done
print int(1e+01)
10
That isn't building an integer: it is creating a float and casting to an int.
try:
int( 1e100 )
Best,
Nathan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
The proposed solution is based on contingency tables, built for selected
groups of columns (not for each possible group). And the contingency
table gives you the ability to estimate the probabilities needed to
compute the selectivity. Or am I missing something?
Well, I'm not real familiar
Robert explained why having more MCVs might be useful because we use
the frequency of the least common MCV as an upper bound on the
frequency of any value in the MCV.
Where is that being used? The only non-MCV frequency estimate that I
recall seeing is ( nrows - n_ndistinct_rows )/ndistinct.
That one's used, too, but the other is used as an upper bound.
n_distinct tends to come out too small on large tables, so that
formula is prone to overestimation. Actually, both formulas are prone
to overestimation.
Right - thanks.
When this happens depends on the values of a whole
The advantage of specifying a + and a - in the type interface is that
the unit definition can then be specified as part of the type
declaration itself. So you can do:
CREATE TYPE ts_sec AS RANGE OVER timestamp (UNIT = '1s');
CREATE TYPE ts_min AS RANGE OVER timestamp (UNIT = '1m');
All of
On the basis of all of the foregoing, I don't think we can consider
this patch further for this CommitFest and will update
commitfest.postgresql.org accordingly.
FWIW, I am very excited about this patch and would be happy to review
it but have been very busy over the past month. If I can
I think it would be great for you to review it... I doubt that will
cause it to get committed for 9.0, but my doubt is no reason for you
to hold off reviewing it.
I assumed so, but the pretense of a chance will probably help to motivate me :-)
I'll have something by Thursday, and then
Because intervals (mathematical not SQL) can be open or closed at each
end point we need to know what the next an previous value would be at
the specified granularity. And while you can do some operations without
knowing this, there are many you can't. For instance you could not tell
whether
Right, I don't think strings are any more or less countable than integers.
(and yes, it's a bit OT).
Well, if I remember my algebra, I think the issue is that integers are
locally finite whereas strings are not ( assuming standard orderings,
of course ).
-Nathan
--
Sent via pgsql-hackers
Again, I'm only one user. But so far I haven't seen anyone else speak
up here, and clearly accepting this for inclusion will need nontrivial
convincing.
Well, FWIW, I am excited about better type integration.
Also, I am a little skeptical about this patch. I am sorry if this has
already been
Here's the patch to support Python =3.1 with PL/Python. The
compatibility code is mostly in line with the usual 2-3 C porting
practice and is documented inline.
I took a cursory look at this patch and, while the logic seems sound
and roughly in line with the suggested python porting
After reading the docs in the patch I don't believe you're going to all
this trouble to ensure two circles don't overlap. Can you give some
better examples of what you're trying to achieve and why anyone else
would care? (I'm busy, so are others).
Non overlapping time intervals is one use
I'm finding myself unable to follow all the terminology on this thead.
What's dimension reduction? What's PCA?
( Sorry for the jargon - thanks Josh )
It feels like what you might need is statistics for colB (MCVs and/or a
histogram) for certain particular values of colA.
Certainly - this
For things like PostGIS, which will want to index in 4 dimensions
(x, y, z, t), we might want to have multi-dimensional selectivity
histograms and some way to use same.
Another use case is cross column statistics.
Anybody here qualified to check out this paper on the subject, please
speak
Finally, this creates the partition but ( AFAICT ) it doesn't describe
a method for locating the histogram estimate given a point ( although
that doesn't seem too difficult ).
Is that not difficult, in terms of the math that needs doing, or
not difficult, in terms of how well PostgreSQL is
On Mon, Jun 29, 2009 at 3:43 PM, Tom Lanet...@sss.pgh.pa.us wrote:
David Fetter da...@fetter.org writes:
On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
... They dismiss
singular value decomposition and the discrete wavelet transform as
being too parametric ( which is silly
Surely the most important point in the OP was that ineqsel does not
correctly binary search in the presence of duplicates.
It would have been if I were correct :-( .
Looking at it again, that was from a bug in my code. Thanks for your
time, and sorry about the noise.
-Nathan
--
Sent via
For heavy tailed distributions, it is possible for analyze to
duplicate histogram boundaries.
I don't think this is a bug.
hmmm... Well, I assumed it was a bug from a comment in analyze.
From ( near ) line 2130 in analyze.c
* least 2 instances in the sample. Also, we won't suppress values
One more direction could be implementing MCV for range of values (group
values and interpolate in between). Consider statistics on timestamp column
that says that for 2008-December there are as many X rows, for
2008-November as many as Y, etc. That could be used for rather accurate
What is the specific difference between what you are talking about and
what scalarineqsel already implements?
Hmm... Northing new. Feel sorry for bothering you. I did not realize
histograms are implemented.
Well, ISTM there is a profound difference. For scalarineqsel we care
about the total
Thanks for the response.
Well, ISTM there is a profound difference. For scalarineqsel we care
about the total number of values in a bucket. For eqsel we care about
the total number of *distinct* values in each bucket
Really?
Well, we would obviously also care about the total number of
Isn't a selectivity estimate of x = v as ( the number of values in v's
histogram bucket ) / ( number of distinct values in v's histogram
bucket ) pretty rational? Thats currently what we do for non-mcv
values, except that we look at ndistinct over the whole table instead
of individual
Thanks for putting out pgtune - it's a sorely needed tool.
I had a chance to look over the source code and have a few comments,
mostly about python specific coding conventions.
- The windows specific try block ( line 16 ) raises a ValueError vs
ImportError on my debian machine. Maybe it would be
I still need to go through backend/utils/adt/selfuncs.c
to figure out exactly how we use the one-dimensional values.
Here's a page that helped me figure all this out.
http://www.postgresql.org/docs/8.1/static/planner-stats-details.html
2) Do we want to fold the MCV's into the dependence
Right now our
histogram values are really quantiles; the statistics_target T for a
column determines a number of quantiles we'll keep track of, and we
grab values from into an ordered list L so that approximately 1/T of
the entries in that column fall between values L[n] and L[n+1]. I'm
I'm still working my way around the math, but copulas sound better
than anything else I've been playing with.
I think the easiest way to think of them is, in 2-D finite spaces,
they are just a plot of the order statistics against one another. Feel
free to mail me off list if you have any math
Yes, but always in relation to operator classes, so from BTrees opclass or
such, which I refered to as the context of index searches, as I don't
really see any theorical need for opclass if it's not for indexing.
IIRC analyze uses the btree op class to build the selectivity histogram.
-Nathan
Assuming that the threshold
for switching to an indexscan is somewhere around selectivity 0.005
(I am not certain offhand, but it's in that general area), this cannot
possibly require more than 200 MCV slots, and for most data
distributions it'd be a whole lot less.
Thats a really good
One more problem with low ndistinct values is that the condition might
very well
hit no rows at all. But Idea 1 will largely overestimate the number of
hits.
Thats a good point, but I don't see a clear solution. Maybe we could
look at past queries
and keep track of how often they return
One more problem with low ndistinct values is that the condition might
very well
hit no rows at all. But Idea 1 will largely overestimate the number of
hits.
Thats a good point, but I don't see a clear solution. Maybe we could
I think that MCVs are the solution, right?
Only if
If we query on values that aren't in the table, the planner will
always overestimate the expected number of returned rows because it (
implicitly ) assumes that every query will return at least 1 record.
That's intentional and should not be changed.
Why? What if ( somehow ) we knew that
Why? What if ( somehow ) we knew that there was a 90% chance that
query would return an empty result set on a big table with 20 non-mcv
distinct values. Currently the planner would always choose a seq scan,
where an index scan might be better.
(1) On what grounds do you assert the above?
Your argument seems to consider only columns having a normal
distribution.
My example was based upon normally distributed data because people
usually know what they are and they are reasonably common.
How badly does it fall apart for non-normal distributions?
This should work to the extent
Currently eqsel assumes that, except for the values stored as mcv's,
the number of times that a value appears in a table column is
independent of it's value. Unfortunately, this often yields
inappropriate selectivity estimates and frequently leads to
inappropriate plans.
As an example, consider
51 matches
Mail list logo