On 02/06/2016 06:47 PM, Tom Lane wrote:
Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
On 02/06/2016 02:34 AM, Tom Lane wrote:
I have sussed what's happening in bug #13908. Basically, commit
45f6240a8fa9d355 ("Pack tuples in a hash join batch densely, to save
memory") broke things for the case where a hash join is using a skew
table.

Damn, that's an embarrassing oversight :-/

I believe the attached patch should fix this by actually copying
the tuples into the densely allocated chunks. Haven't tested it
though, will do in a few hours.

Yeah, that's one fix approach I was contemplating last night. (I
think the patch as written leaks memory and doesn't account for
space usage properly either, but certainly this is a direction we
could take.)

Yes, it definitely needs more work (to free the original tuple copy after moving it into the dense_alloc chunk).

The other answer I was thinking about was to get rid of the
assumption that iterating over the chunk storage is a valid thing to
do, and instead scan the hashbucket chains when we need to visit all
tuples. I really do not like the patch as designed, for several
reasons:

* It incorporates a bespoke reimplementation of palloc into hash
joins. This is not a maintainable/scalable way to go about reducing
memory consumption. It should have been done with an arm's-length API
to a new type of memory context, IMO (probably one that supports
palloc but not pfree, repalloc, or any chunk-header-dependent
operations).

Hmmm, interesting idea. I've been thinking about doing this using a memory context when writing the dense allocation, but was stuck in the "must support all operations" mode, making it impossible. Disallowing some of the operations would make it a viable approach, I guess.

* No arm's-length API would conceivably allow remote callers to
iterate over all allocated chunks in the way this code does, which is
why we need to get rid of that behavior.

I'm not convinced we should throw away the idea of walking the chunks. I think it's kinda neat and I've been playing with postponing constructing the buckets until the very end of Hash build - it didn't work as good as expected, but I'm not ready to throw in the towel yet.

But perhaps the new memory context implementation could support some sort of iterator ...

* There's no way to delete single tuples from the hash table given
this coding, which no doubt is why you didn't migrate the skew tuples
into this representation; but it doesn't seem like a very
future-proof data structure.

I don't recall, it may be one of the reasons why the skew buckets use regular allocation. But I don't see how using a new type memory context could solve this, as it won't support pfree either. Maybe using a separate context for each skew bucket?

* Not doing anything for the skew tuples doesn't seem very good
either, considering the whole point of that sub-module is that there
are likely to be a lot of them.

Maybe I'm missing something, but I thought that the values tracked in skew buckets are the MCVs from the outer table, in the hope that we'll reduce the amount of data that needs to be spilled to disk when batching the outer relation.

I don't see why there should be a lot of them in the inner relation (well, I can imagine cases like that, but in my experience those are rare cases).

I note also that while the idea of ExecHashRemoveNextSkewBucket is
to reduce memory consumed by the skew table to make it available to
the main hash table, in point of fact it's unlikely that the freed
space will be of any use at all, since it will be in tuple-sized
chunks far too small for dense_alloc's requests. So the spaceUsed
bookkeeping being done there is an academic exercise unconnected to
reality, and we need to rethink what the space management plan is.

I don't follow. Why would these three things (sizes of allocations in skew buckets, chunks in dense allocator and accounting) be related?

FWIW the dense allocator actually made the memory accounting way more accurate, actually, as it eliminates most of the overhead that was not included in spaceUsed before.

So I'm of the opinion that a great deal more work is needed here. But
it's not something we're going to be able to get done for 9.5.1, or
realistically 9.5.anything. Whereas adding additional klugery to
ExecHashRemoveNextSkewBucket probably is doable over the weekend.

Agreed.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to