On 10/09/2016 03:27 AM, Peter Geoghegan wrote:
You shouldn't really waste 8% of the budget with low work_mem settings
with my cap patch applied, because the new cap never limits the number
of tapes. IIRC, the cap of 500 tapes doesn't start to matter until you
have about 1GB of work_mem. So, if there is still any waste at the low
end, that can only be solved by tweaking the main calculation within
tuplesort_merge_order(). (Also, to be clear to others following along:
that memory is never actually allocated, so it's only "wasted" from
the perspective of one sort operation alone).


Regardless of the number of tapes, the memory used for the tape buffers, while building the initial runs, is wasted. It's not entirely wasted when you have multiple merge passes, because without the buffer you need to read the last partial page back into memory, when you start to output the next run on it. But even in that case, I believe that memory would be better used for the quicksort, to create larger runs.

The cost of multiple passes is paid in sequential I/O of tape temp
files, which is now clearly a relatively low cost. OTOH, the cost of a
larger merge heap is paid in more CPU cache misses, which is a cost we
can feel quite severely. While it's really hard to come up with a
generic break-even point, I do think that there is value in capping
the number of tapes somewhere in the hundreds. It's not like a cap on
the number of tapes along the lines I've proposed was not thought
about from day one, by both Tom and Simon. Noah's relatively recent
MaxAllocSize work has given the issue new importance, though (the same
might have been said during the 9.6 replacement selection vs.
quicksort discussions, actually).

Yeah, there might be value in having a cap, because operating the merge heap becomes more expensive when it becomes larger. Mainly because of cache misses. We should perhaps have a limit on the size of the merge heap, and use the same limit for the replacement selection. Although, the heap behaves slightly differently during replacement selection: During replacement selection, new values added to the heap tend to go to the bottom of the heap, but during merge, new values tend to go close to the top. The latter pattern incurs fewer cache misses.

But that's orthogonal to the wasting-memory aspect of this. Even if a we have a cap of 100, it's good to not spend memory for the tape buffers of those 100 tapes, when building the initial runs.

When we finish writing an initial run to a tape, we keep the tape buffers
around. That's the reason we need to reserve that memory. But there's no
fundamental reason we have to do that. As I suggested in [2], we could flush
the buffers to disk, and only keep in memory the location of the last block
we wrote. If we need to write another run to the tape, we can reload the
last incomplete block from the disk, and continue writing.

Okay. But, you haven't actually addressed the problem with non-trivial
amounts of memory being logically allocated ahead of time for no good
reason -- you've only address the constant factor (the overhead
per-tape).

I thought I did. Can you elaborate?

Are you referring to the fact that the array of LogicalTapes is still allocated ahead of time, with size maxTapes? True, it'd be nice to allocate the LogicalTape structs only when needed. But that's peanuts, compared to the tape buffers.

In addition to saving a little bit of memory, I'd like to do this
refactoring because it simplifies the code. It's code that has stayed
largely unchanged for the past 15 years, so I'm not too eager to touch it,
but looking at the changes coming with Peter's parallel tuplesort patch set,
I think this makes sense.

I can definitely see value in refactoring, to make that code less
complicated -- I just don't think it's justified by the amount of
memory that is wasted on tapes.

Ok, good. I think we're in agreement on doing this, then, even if we don't agree on the justification :-).

That said, I'm not especially worried about the complexity within the
logtape.c changes of the parallel CREATE INDEX patch alone. I'm much
more interested in having a logtape.c that could be more easily made
to support binary searching, etc, to find *partition* boundaries,
which my CREATE INDEX patch doesn't use or care about. This is needed
for tuplesort.c partition-based sorting. When parallel sort isn't just
used by CREATE INDEX, partitioning becomes important. And, with
partitioning, dynamic sampling is essential (I think that this will
end up living in logtape.c).

Yeah. I'm not sure how partitioning and all that would be done here. But I'm prettty sure this simpler logtape.c code is easier to work with, for partitioning too.

We might want to have a each partition on a separate tape, for example, and therefore have a lot more tapes than currently. Not wasting memory on the tape buffers becomes important in that case.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to