Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

Tomas Vondra Sat, 06 Feb 2016 11:35:31 -0800

On 02/06/2016 06:47 PM, Tom Lane wrote:

Tomas Vondra <tomas.von...@2ndquadrant.com> writes:

On 02/06/2016 02:34 AM, Tom Lane wrote:

I have sussed what's happening in bug #13908. Basically, commit
45f6240a8fa9d355 ("Pack tuples in a hash join batch densely, to save
memory") broke things for the case where a hash join is using a skew
table.

Damn, that's an embarrassing oversight :-/

I believe the attached patch should fix this by actually copying
the tuples into the densely allocated chunks. Haven't tested it
though, will do in a few hours.


Yeah, that's one fix approach I was contemplating last night. (I
think the patch as written leaks memory and doesn't account for
space usage properly either, but certainly this is a direction we
could take.)

Yes, it definitely needs more work (to free the original tuple copyafter moving it into the dense_alloc chunk).

The other answer I was thinking about was to get rid of the
assumption that iterating over the chunk storage is a valid thing to
do, and instead scan the hashbucket chains when we need to visit all
tuples. I really do not like the patch as designed, for several
reasons:

* It incorporates a bespoke reimplementation of palloc into hash
joins. This is not a maintainable/scalable way to go about reducing
memory consumption. It should have been done with an arm's-length API
to a new type of memory context, IMO (probably one that supports
palloc but not pfree, repalloc, or any chunk-header-dependent
operations).

Hmmm, interesting idea. I've been thinking about doing this using amemory context when writing the dense allocation, but was stuck in the"must support all operations" mode, making it impossible. Disallowingsome of the operations would make it a viable approach, I guess.

* No arm's-length API would conceivably allow remote callers to
iterate over all allocated chunks in the way this code does, which is
why we need to get rid of that behavior.

I'm not convinced we should throw away the idea of walking the chunks. Ithink it's kinda neat and I've been playing with postponing constructingthe buckets until the very end of Hash build - it didn't work as good asexpected, but I'm not ready to throw in the towel yet.

But perhaps the new memory context implementation could support somesort of iterator ...

* There's no way to delete single tuples from the hash table given
this coding, which no doubt is why you didn't migrate the skew tuples
into this representation; but it doesn't seem like a very
future-proof data structure.

I don't recall, it may be one of the reasons why the skew buckets useregular allocation. But I don't see how using a new type memory contextcould solve this, as it won't support pfree either. Maybe using aseparate context for each skew bucket?

* Not doing anything for the skew tuples doesn't seem very good
either, considering the whole point of that sub-module is that there
are likely to be a lot of them.

Maybe I'm missing something, but I thought that the values tracked inskew buckets are the MCVs from the outer table, in the hope that we'llreduce the amount of data that needs to be spilled to disk when batchingthe outer relation.

I don't see why there should be a lot of them in the inner relation(well, I can imagine cases like that, but in my experience those arerare cases).

I note also that while the idea of ExecHashRemoveNextSkewBucket is
to reduce memory consumed by the skew table to make it available to
the main hash table, in point of fact it's unlikely that the freed
space will be of any use at all, since it will be in tuple-sized
chunks far too small for dense_alloc's requests. So the spaceUsed
bookkeeping being done there is an academic exercise unconnected to
reality, and we need to rethink what the space management plan is.

I don't follow. Why would these three things (sizes of allocations inskew buckets, chunks in dense allocator and accounting) be related?

FWIW the dense allocator actually made the memory accounting way moreaccurate, actually, as it eliminates most of the overhead that was notincluded in spaceUsed before.

So I'm of the opinion that a great deal more work is needed here. But
it's not something we're going to be able to get done for 9.5.1, or
realistically 9.5.anything. Whereas adding additional klugery to
ExecHashRemoveNextSkewBucket probably is doable over the weekend.


Agreed.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

Reply via email to