On 10/04/2016 02:09 PM, Simon Riggs wrote:
On 4 October 2016 at 11:47, Heikki Linnakangas <hlinn...@iki.fi> wrote:

When we finish writing an initial run to a tape, we keep the tape buffers
around. That's the reason we need to reserve that memory. But there's no
fundamental reason we have to do that. As I suggested in [2], we could flush
the buffers to disk, and only keep in memory the location of the last block
we wrote. If we need to write another run to the tape, we can reload the
last incomplete block from the disk, and continue writing.

Reloading the last block, requires an extra I/O. That's OK. It's quite rare
to have a multi-pass sort these days, so it's a good bet that you don't need
to do it. And if you have a very small work_mem, so that you need to do a
multi-pass sort, having some more memory available for building the initial
runs is probably worth it, and there's also a good chance that the block is
still in the OS cache.

Sounds like a good idea.

Why not just make each new run start at a block boundary?
That way we waste on average BLCKSZ/2 disk space per run, which is
negligible but we avoid any need to have code to read back in the last

Hmm. You'd still have to read back the last block, so that you can update its next-pointer.

What you could do, though, is to always store only one run on each tape. If we can make the in-memory representation of a tape small enough, that might be acceptable. You only need 4 bytes to store the starting block number. Or we could dump the starting block numbers of the tapes, to yet another tape. The number of tapes you merge in one pass would still be limited by the amount of memory you have, but the number of tapes would be unlimited.

I don't want to do that right now, it gets more invasive. Something to keep in mind for the future, though. This also related to your other suggestion below:

That also makes it easier to put each run in its own file, which will
surely help with parallelism and compression since we can spread
multiple run files across multiple temp tablespaces.

Can we also please read in the value from the last tuple in a run, so
we have min and max values in memory for each run? That can be used
during the merge step to avoid merging runs unless the value ranges

This gets more advanced... Yeah, stuff like that would make sense. You could also split the tapes into smaller parts, and store the min/max for each part, to make the merging even more efficient and easier to parallelize. Or to take that to the extreme, you could make each tape a little B-tree.

But that's for another day and another patch :-). What I have now is a drop-in replacement for the current logical tape implementation. These more advanced schemes would build on top of that.

- Heikki

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to