On 09/09/2015 03:55 PM, Robert Haas wrote:
On Tue, Sep 8, 2015 at 5:02 PM, Tomas Vondra
Also, I'm not sure what other places do you have in mind (could you list
some examples?) but I'd bet we limit the allocation to 1GB because of the
palloc() limit and not because of fear of over-estimates.
I don't really think those two things are different from each other.
The palloc() limit is a means of enforcing a general policy of
limiting all allocations to 1GB except in places where we've made a
very conscious decision to allow a specific exception. This limit
happens to dovetail nicely with the varlena size limit, so in many
cases it is the exactly correct limit just for that reason. But even
when, as here, that's not at issue, it's still a useful limit, because
there are many ways that some garbage value can get passed to palloc
-- bad planner estimates, corrupted tuples, bugs in other parts of our
code. And at least on my old MacBook Pro (I haven't tested the
current one), passing a sufficiently-large value to malloc() causes a
kernel panic. That's probably a particularly bad bug, but there are
lots of systems where "accidentally" allocating an unreasonable amount
of space will have all kinds of unpleasant consequences. So, I
believe that palloc()'s limit improves the overall stability of the
system considerably even if it causes some occasional annoyance.
I'm not really buying this. The 1GB has nothing to do with platform
limits, it's there exactly to make it varlena-like (which has exactly
the same limit), and because it allows using 32-bit int to track all the
bytes. Neither of these is relevant here.
It has nothing to do with malloc() limits on various platforms, and if
there really are such limits that we think we should worry about, we
should probably address those properly. Not by and-aiding all the
various places independently.
And most importantly, these platform limits would apply to both the
initial allocation and to the subsequent resize. It's utterly useless to
just "fix" the initial allocation and then allow failure when we try to
resize the hash table.
Most of the time, you can just palloc() and not worry too much about
whether you're going to blow up the machine: you won't, because you
aren't going to allocate more than 1GB. Any place that wants to
allocate more than that needs to be someplace where we can be pretty
sure that we're not going to accidentally allocate some completely
unreasonable amount of memory, like say 1TB. Nothing in this
discussion convinces me that this is such a place. Note that
We're not going to allocate a completely unreasonable amount of memory,
because there already are some checks in place.
Firstly, you can't really get buckets larger than ~25% of work_mem,
because we the pointer has only 8B, while the HJ tuple has 16B plus the
data (IIRC). For wider tuples the size of buckets further decreases.
Secondly, we limit the number of buckets to INT_MAX, so about 16GB
(because buckets are just pointers). No matter how awful estimate you
get (or how insanely high you set work_mem) you can't exceed this.
tuplesort.c and tuplestore.c, the only existing callers of
repalloc_huge, only allocate such large amounts of memory when they
actually have enough tuples to justify it - it is always based on the
actual number of tuples, never an estimate. I think that would be a
sound principle here, too. Resizing the hash table to such a large
size based on the actual load factor is very reasonable; starting with
such a large size seems less so. Admittedly, 512MB is an arbitrary
point: and if it so happened that the limit was 256MB or 1GB or 128MB
or even 2GB I wouldn't advocate for changing it just for fun. But
you're saying we should just remove that limit altogether, and I think
that's clearly unreasonable. Do you really want to start out with a
TB or even PB-sized hash table when the actual number of tuples is,
say, one? That may sound crazy, but I've seen enough bad query plans
to know that, yes, we are sometimes off by nine orders of magnitude.
This is not a hypothetical problem.
No, I'm not saying anything like that - I actually explicitly stated
that I'm not against such change (further restricting the initial hash
table size), if someone takes the time to do a bit of testing and
provide some numbers.
Moreover as I explained there already are limits in place (25% of
work_mem or 16GB, whichever is lower), so I don't really see the bugfix
Maybe if we decide to lift this restriction (using int64 to address the
buckets, which removes the 16GB limit) this issue will get much more
pressing. But I guess hash tables handling 2B buckets will be enough for
the near future.
More importantly, removing the cap on the allocation size makes the
problem a lot worse. You might be sad if a bad planner estimate
causes the planner to allocate 1GB when 64MB would have been enough,
but on modern systems it is not likely to be an enormous problem. If
a similar mis-estimation causes the planner to allocate 16GB rather
than 1GB, the opportunity for you to be sad is magnified pretty
considerably. Therefore, I don't really see the over-estimation bug
fix as being separate from this one.
Perhaps. But if you want to absolutely prevent such sadness then maybe you
should not set work_mem that high?
I think that's a red herring for a number of reasons. One, the
allocation for the hash buckets is only a small portion of the total
memory. Two, the fact that you are OK with the hash table growing to
a certain size does not mean that you want it to start out that big on
the strength of a frequently-flawed planner estimate.
True, although I don't think the herring is entirely red. It might just
as well be a mackerel ;-)
Anyway, I do see this as a rather orthogonal problem - an independent
improvement, mostly unrelated to the bugfix. Even if we decide to redesign
it like this (and I'm not particularly opposed to that, assuming someone
takes the time to measure how expensive the additional resize actually is),
we'll still have to fix the repalloc().
So I still fail to see why we shouldn't apply this fix.
In all seriousness, that is fine. I respect your opinion; I'm just
telling you mine, which happens to be different.
Let me repeat my previous proposal:
1) Let's apply the proposed bugfix (and also backpatch it), because
the current code *is* broken.
2) Do a bunch of experiments with limiting the initial hash size,
decide whether the impact on well-estimated cases is acceptable.
I'm strongly opposed to just limiting the initial size without actually
measuring how expensive the resize is, as that simply adds cost to the
well-estimated cases (which may easily be the vast majority of all the
plans, as we tend to notice just the poorly estimated ones).
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: