On Tue, Jan 14, 2014 at 10:12 PM, Magnus Hagander <mag...@hagander.net> wrote: > On Tue, Jan 14, 2014 at 1:47 PM, Michael Paquier <michael.paqu...@gmail.com> > wrote: >> >> Hi all, >> >> As of today, replication protocol has a command called BASE_BACKUP to >> allow a client connecting with the replication protocol to retrieve a >> full backup from server through a connection stream. The description >> of its current options are here: >> http://www.postgresql.org/docs/9.3/static/protocol-replication.html >> >> This command is in charge to put the server in start backup by using >> do_pg_start_backup, then it sends the backup, and finalizes the backup >> with do_pg_stop_backup. Thanks to that it is as well possible to get >> backups from even standby nodes as the stream contains the >> backup_label file necessary for recovery. Full backup is sent in tar >> format for obvious performance reasons to limit the amount of data >> sent through the stream, and server contains necessary coding to send >> the data in correct format. This forces the client as well to perform >> some decoding if the output of the base backup received needs to be >> analyzed on the fly but doing something similar to what now >> pg_basebackup does when the backup format is plain. >> >> I would like to propose the following things to extend BASE_BACKUP to >> retrieve a backup from a stream: >> - Addition of an option FORMAT, to control the output format of >> backup, with possible options as 'plain' and 'tar'. Default is tar for >> backward compatibility purposes. The purpose of this option is to make >> easier for backup tools playing with postgres to retrieve and backup >> and analyze it on the fly, the purpose being to filter and analyze the >> data while it is being received without all the tar decoding >> necessary, what would consist in copying portions of pg_basebackup >> code more or less. > > > How would this be different/better than the tar format? pg_basebackup > already does this analysis, for example, when it comes to recovery.conf. > The tar format is really easy to analyze as a stream, that's one of the > reasons we picked it... > > >> >> - Addition of an option called INCREMENTAL to send an incremental >> >> backup to the client. This option uses as input an LSN, and sends back >> to client relation pages (in the shape of reduced relation files) that >> are newer than the LSN specified by looking at pd_lsn of >> PageHeaderData. In this case the LSN needs to be determined by client >> based on the latest full backup taken. This option is particularly >> interesting to reduce the amount of data taken between two backups, >> even if it increases the restore time as client needs to reconstitute >> a base backup depending on the recovery target and the pages modified. >> Client would be in charge of rebuilding pages from incremental backup >> by scanning all the blocks that need to be updated based on the full >> backup as the LSN from which incremental backup is taken is known. But >> this is not really something the server cares about... Such things are >> actually done by pg_rman as well. > > > This sounds a lot like DIFFERENTIAL in other databases? Or I guess it's the > same underlying technology, depending only on if you go back to the full > base backup, or to the last incremental one. Yes, that's actually a LSN-differential, I got my head in pg_rman for a couple of weeks, where a similar idea is called incremental there.
> > But if you look at the terms otherwise, I think incremental often refers to > what we call WAL. > > Either way - if we can do this in a safe way, it sounds like a good idea. It > would be sort of like rsync, except relying on the fact that we can look at > the LSN and don't have to compare the actual files, right? Yep, that's the idea. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers