Apologies for the delayed response to this.

On Tue, Oct 4, 2016 at 3:47 AM, Heikki Linnakangas <hlinn...@iki.fi> wrote:
> One of the patches in Peter Geoghegan's Parallel tuplesort patch set [1] is
> to put a cap on the number of tapes to use for the sort.

> That amounts to about 8% of the available memory. That's quite wasteful.
> Peter's approach of putting an arbitrary cap on the max number of tapes
> works, but I find it a bit hackish. And you still waste that 8% with smaller
> work_mem settings.

I don't think it's hackish -- there was never a theoretical
justification for using as many tapes as possible from Knuth, or
anyone else. I think that Simon's 2006 work on allowing for a number
of tapes much greater than Knuth's "sweet spot" of 7 was useful only
because it sometimes enabled final on-the-fly merging where we are not
merging a huge number of runs (i.e. when merging only somewhat more
than 7 runs). Your recent work on tape preloading has probably greatly
diminished the value of not doing all merging on-the-fly in a single,
final merge, though. Knuth's sweet spot of 7 had little or nothing to
do with the economics of buying many tape drives.

You shouldn't really waste 8% of the budget with low work_mem settings
with my cap patch applied, because the new cap never limits the number
of tapes. IIRC, the cap of 500 tapes doesn't start to matter until you
have about 1GB of work_mem. So, if there is still any waste at the low
end, that can only be solved by tweaking the main calculation within
tuplesort_merge_order(). (Also, to be clear to others following along:
that memory is never actually allocated, so it's only "wasted" from
the perspective of one sort operation alone).

The cost of multiple passes is paid in sequential I/O of tape temp
files, which is now clearly a relatively low cost. OTOH, the cost of a
larger merge heap is paid in more CPU cache misses, which is a cost we
can feel quite severely. While it's really hard to come up with a
generic break-even point, I do think that there is value in capping
the number of tapes somewhere in the hundreds. It's not like a cap on
the number of tapes along the lines I've proposed was not thought
about from day one, by both Tom and Simon. Noah's relatively recent
MaxAllocSize work has given the issue new importance, though (the same
might have been said during the 9.6 replacement selection vs.
quicksort discussions, actually).

> When we finish writing an initial run to a tape, we keep the tape buffers
> around. That's the reason we need to reserve that memory. But there's no
> fundamental reason we have to do that. As I suggested in [2], we could flush
> the buffers to disk, and only keep in memory the location of the last block
> we wrote. If we need to write another run to the tape, we can reload the
> last incomplete block from the disk, and continue writing.

Okay. But, you haven't actually addressed the problem with non-trivial
amounts of memory being logically allocated ahead of time for no good
reason -- you've only address the constant factor (the overhead
per-tape). Can't we also fix the general problem, by applying a cap?
Better to have a cap that is approximately right (in the hundreds or
so) than one that is exactly wrong (infinity -- no cap).

> Reloading the last block, requires an extra I/O. That's OK. It's quite rare
> to have a multi-pass sort these days, so it's a good bet that you don't need
> to do it. And if you have a very small work_mem, so that you need to do a
> multi-pass sort, having some more memory available for building the initial
> runs is probably worth it, and there's also a good chance that the block is
> still in the OS cache.

That analysis does seem sound to me.

> In addition to saving a little bit of memory, I'd like to do this
> refactoring because it simplifies the code. It's code that has stayed
> largely unchanged for the past 15 years, so I'm not too eager to touch it,
> but looking at the changes coming with Peter's parallel tuplesort patch set,
> I think this makes sense.

I can definitely see value in refactoring, to make that code less
complicated -- I just don't think it's justified by the amount of
memory that is wasted on tapes.

That said, I'm not especially worried about the complexity within the
logtape.c changes of the parallel CREATE INDEX patch alone. I'm much
more interested in having a logtape.c that could be more easily made
to support binary searching, etc, to find *partition* boundaries,
which my CREATE INDEX patch doesn't use or care about. This is needed
for tuplesort.c partition-based sorting. When parallel sort isn't just
used by CREATE INDEX, partitioning becomes important. And, with
partitioning, dynamic sampling is essential (I think that this will
end up living in logtape.c).

To recap on what I went into in the paritioning-to-parallel-tuplesort
thread [1], I think that partitioning will come in a future release,
and will be of more value to parallel queries, where much more can be
pushed down within the executor; we really want to remove
co-dependencies across workers early. I'm much less excited about how
much that will benefit parallel CREATE INDEX, though, where
partition-to-sort would need to deal with the complexity of *actually
writing* the finished index in parallel (at least, to see any
benefit), fixing up internal pages, etc. And, any benefit would be
limited, because writing itself is I/O bound -- there is never too
much to push down, unlike with real parallel query. (I feel fairly
confident about this because my CREATE INDEX patch seems to make
Postgres parallel CREATE INDEX have comparable scalability to the
implementations in other major systems.)

Peter Geoghegan

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to