On Tue, Jan 14, 2014 at 10:12 PM, Magnus Hagander <mag...@hagander.net> wrote:
> On Tue, Jan 14, 2014 at 1:47 PM, Michael Paquier <michael.paqu...@gmail.com>
> wrote:
>>
>> Hi all,
>>
>> As of today, replication protocol has a command called BASE_BACKUP to
>> allow a client connecting with the replication protocol to retrieve a
>> full backup from server through a connection stream. The description
>> of its current options are here:
>> http://www.postgresql.org/docs/9.3/static/protocol-replication.html
>>
>> This command is in charge to put the server in start backup by using
>> do_pg_start_backup, then it sends the backup, and finalizes the backup
>> with do_pg_stop_backup. Thanks to that it is as well possible to get
>> backups from even standby nodes as the stream contains the
>> backup_label file necessary for recovery. Full backup is sent in tar
>> format for obvious performance reasons to limit the amount of data
>> sent through the stream, and server contains necessary coding to send
>> the data in correct format. This forces the client as well to perform
>> some decoding if the output of the base backup received needs to be
>> analyzed on the fly but doing something similar to what now
>> pg_basebackup does when the backup format is plain.
>>
>> I would like to propose the following things to extend BASE_BACKUP to
>> retrieve a backup from a stream:
>> - Addition of an option FORMAT, to control the output format of
>> backup, with possible options as 'plain' and 'tar'. Default is tar for
>> backward compatibility purposes. The purpose of this option is to make
>> easier for backup tools playing with postgres to retrieve and backup
>> and analyze it on the fly, the purpose being to filter and analyze the
>> data while it is being received without all the tar decoding
>> necessary, what would consist in copying portions of pg_basebackup
>> code more or less.
>
>
> How would this be different/better than the tar format? pg_basebackup
> already does this analysis, for example, when it comes to recovery.conf.
> The tar format is really easy to analyze as a stream, that's one of the
> reasons we picked it...
>
>
>>
>> - Addition of an option called INCREMENTAL to send an incremental
>>
>> backup to the client. This option uses as input an LSN, and sends back
>> to client relation pages (in the shape of reduced relation files) that
>> are newer than the LSN specified by looking at pd_lsn of
>> PageHeaderData. In this case the LSN needs to be determined by client
>> based on the latest full backup taken. This option is particularly
>> interesting to reduce the amount of data taken between two backups,
>> even if it increases the restore time as client needs to reconstitute
>> a base backup depending on the recovery target and the pages modified.
>> Client would be in charge of rebuilding pages from incremental backup
>> by scanning all the blocks that need to be updated based on the full
>> backup as the LSN from which incremental backup is taken is known. But
>> this is not really something the server cares about... Such things are
>> actually done by pg_rman as well.
>
>
> This sounds a lot like DIFFERENTIAL in other databases? Or I guess it's the
> same underlying technology, depending only on if you go back to the full
> base backup, or to the last incremental one.
Yes, that's actually a LSN-differential, I got my head in pg_rman for
a couple of weeks, where a similar idea is called incremental there.

>
> But if you look at the terms otherwise, I think incremental often refers to
> what we call WAL.
>
> Either way - if we can do this in a safe way, it sounds like a good idea. It
> would be sort of like rsync, except relying on the fact that we can look at
> the LSN and don't have to compare the actual files, right?
Yep, that's the idea.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to