On Wed, Apr 24, 2019 at 09:25:12AM -0400, Robert Haas wrote:
On Mon, Apr 22, 2019 at 9:51 PM Robert Haas <robertmh...@gmail.com> wrote:
For this particular use case, wouldn't you want to read the WAL itself
and use that to issue prefetch requests?  Because if you use the
.modblock files, the data file blocks will end up in memory but the
WAL blocks won't, and you'll still be waiting for I/O.

I'm still interested in the answer to this question, but I don't see a
reply that specifically concerns it.  Apologies if I have missed one.


I don't think prefetching WAL blocks is all that important. The WAL
segment was probably received fairly recently (either from primary or
archive) and so it's reasonable to assume it's still in page cache. And
even if it's not, sequential reads are handled by readahead pretty well.
Which is a form of prefetching.

But even if WAL prefetching was useful in some cases, I think it's mostly
orthogonal issue - it certainly does not make prefetching of data pages
unnecessary.

Stepping back a bit, I think that the basic issue under discussion
here is how granular you want your .modblock files.  At one extreme,
one can imagine an application that wants to know exactly which blocks
were accessed at exact which LSNs.  At the other extreme, if you want
to run a daily incremental backup, you just want to know which blocks
have been modified between the start of the previous backup and the
start of the current backup - i.e. sometime in the last ~24 hours.
These are quite different things.  When you only want approximate
information - is there a chance that this block was changed within
this LSN range, or not? - you can sort and deduplicate in advance;
when you want exact information, you cannot do that.  Furthermore, if
you want exact information, you must store an LSN for every record; if
you want approximate information, you emit a file for each LSN range
and consider it sufficient to know that the change happened somewhere
within the range of LSNs encompassed by that file.


Those are the extreme design options, yes. But I think there may be a
reasonable middle ground, that would allow using the modblock files for
both use cases.

It's pretty clear in my mind that what I want to do here is provide
approximate information, not exact information.  Being able to sort
and deduplicate in advance seems critical to be able to make something
like this work on high-velocity systems.

Do you have any analysis / data to support that claim? I mean, it's
obvious that sorting and deduplicating the data right away makes
subsequent processing more efficient, but it's not clear to me that not
doing it would make it useless for high-velocity systems.

If you are generating a
terabyte of WAL between incremental backups, and you don't do any
sorting or deduplication prior to the point when you actually try to
generate the modified block map, you are going to need a whole lot of
memory (and CPU time, though that's less critical, I think) to process
all of that data.  If you can read modblock files which are already
sorted and deduplicated, you can generate results incrementally and
send them to the client incrementally and you never really need more
than some fixed amount of memory no matter how much data you are
processing.


Sure, but that's not what I proposed elsewhere in this thread. My proposal
was to keep mdblocks "raw" for WAL segments that were not recycled yet (so
~3 last checkpoints), and deduplicate them after that. So vast majority of
the 1TB of WAL will have already deduplicated data.

Also, maybe we can do partial deduplication, in a way that would be useful
for prefetching. Say we only deduplicate 1MB windows - that would work at
least for cases that touch the same page frequently (say, by inserting to
the tail of an index, or so).

While I'm convinced that this particular feature should provide
approximate rather than exact information, the degree of approximation
is up for debate, and maybe it's best to just make that configurable.
Some applications might work best with small modblock files covering
only ~16MB of WAL each, or even less, while others might prefer larger
quanta, say 1GB or even more.  For incremental backup, I believe that
the quanta will depend on the system velocity.  On a system that isn't
very busy, fine-grained modblock files will make incremental backup
more efficient.  If each modblock file covers only 16MB of data, and
the backup manages to start someplace in the middle of that 16MB, then
you'll only be including 16MB or less of unnecessary block references
in the backup so you won't incur much extra work.  On the other hand,
on a busy system, you probably do not want such a small quantum,
because you will then up with gazillions of modblock files and that
will be hard to manage.  It could also have performance problems,
because merging data from a couple of hundred files is fine, but
merging data from a couple of hundred thousand files is going to be
inefficient.  My experience hacking on and testing tuplesort.c a few
years ago (with valuable tutelage by Peter Geoghegan) showed me that
there is a slow drop-off in efficiency as the merge order increases --
and in this case, at some point you will blow out the size of the OS
file descriptor table and have to start opening and closing files
every time you access a different one, and that will be unpleasant.
Finally, deduplication will tend to be more effective across larger
numbers of block references, at least on some access patterns.


I agree with those observations in general, but I don't think it somehow
proves we have to deduplicate/sort the data.

FWIW no one cares about low-velocity systems. While raw modblock files
would not be an issue on them, it's also mostly uninteresting from the
prefetching perspective. It's the high-velocity sytems that have lag.

So all of that is to say that if somebody wants modblock files each of
which covers 1MB of WAL, I think that the same tools I'm proposing to
build here for incremental backup could support that use case with
just a configuration change.  Moreover, the resulting files would
still be usable by the incremental backup engine.  So that's good: the
same system can, at least to some extent, be reused for whatever other
purposes people want to know about modified blocks.

+1 to configuration change, at least during the development phase. It'll
allow comfortable testing and benchmarking.

On the other hand, the incremental backup engine will likely not cope
smoothly with having hundreds of thousands or millions of modblock files
shoved down its gullet, so if there is a dramatic difference in the
granularity requirements of different consumers, another approach is
likely indicated.  Especially if some consumer wants to see block
references in the exact order in which they appear in WAL, or wants to
know the exact LSN of each reference, it's probably best to go for a
different approach.  For example, pg_waldump could grow a new option
which spits out just the block references and in a format designed to be
easily machine-parseable; or a hypothetical background worker that does
prefetching for recovery could just contain its own copy of the
xlogreader machinery.


Again, I don't think we have to keep the raw modblock files forever. Send
them to the archive, remove/deduplicate/sort them after we recycle the WAL
segment, or something like that. That way the incremental backups don't
need to deal with excessive number of modblock files.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply via email to