Re: [HACKERS] Logical tape pause/resume

Heikki Linnakangas Tue, 04 Oct 2016 04:48:07 -0700

On 10/04/2016 02:09 PM, Simon Riggs wrote:

On 4 October 2016 at 11:47, Heikki Linnakangas <hlinn...@iki.fi> wrote:

When we finish writing an initial run to a tape, we keep the tape buffers
around. That's the reason we need to reserve that memory. But there's no
fundamental reason we have to do that. As I suggested in [2], we could flush
the buffers to disk, and only keep in memory the location of the last block
we wrote. If we need to write another run to the tape, we can reload the
last incomplete block from the disk, and continue writing.

Reloading the last block, requires an extra I/O. That's OK. It's quite rare
to have a multi-pass sort these days, so it's a good bet that you don't need
to do it. And if you have a very small work_mem, so that you need to do a
multi-pass sort, having some more memory available for building the initial
runs is probably worth it, and there's also a good chance that the block is
still in the OS cache.


Sounds like a good idea.

Why not just make each new run start at a block boundary?
That way we waste on average BLCKSZ/2 disk space per run, which is
negligible but we avoid any need to have code to read back in the last
block.

Hmm. You'd still have to read back the last block, so that you canupdate its next-pointer.

What you could do, though, is to always store only one run on each tape.If we can make the in-memory representation of a tape small enough, thatmight be acceptable. You only need 4 bytes to store the starting blocknumber. Or we could dump the starting block numbers of the tapes, to yetanother tape. The number of tapes you merge in one pass would still belimited by the amount of memory you have, but the number of tapes wouldbe unlimited.

I don't want to do that right now, it gets more invasive. Something tokeep in mind for the future, though. This also related to your othersuggestion below:

That also makes it easier to put each run in its own file, which will
surely help with parallelism and compression since we can spread
multiple run files across multiple temp tablespaces.

Can we also please read in the value from the last tuple in a run, so
we have min and max values in memory for each run? That can be used
during the merge step to avoid merging runs unless the value ranges
overlap.

This gets more advanced... Yeah, stuff like that would make sense. Youcould also split the tapes into smaller parts, and store the min/max foreach part, to make the merging even more efficient and easier toparallelize. Or to take that to the extreme, you could make each tape alittle B-tree.

But that's for another day and another patch :-). What I have now is adrop-in replacement for the current logical tape implementation. Thesemore advanced schemes would build on top of that.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical tape pause/resume

Reply via email to