Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

Joachim Wieland Thu, 20 Jan 2011 05:46:53 -0800

On Thu, Jan 20, 2011 at 6:07 AM, Heikki Linnakangas
<heikki.linnakan...@enterprisedb.com> wrote:
>> It's part of the overall idea to make sure files are not inadvertently
>> exchanged between different backups and that a file is not truncated.
>> In the future I'd also like to add a checksum to the TOC so that a
>> backup can be checked for integrity. This will cost performance but
>> with the parallel backup it can be distributed to several processors.
>
> Ok. I'm going to leave out the filesize. I can see some value in that, and
> the CRC, but I don't want to add stuff that's not used at this point.


Okay.

>> The header is there to identify a file, it contains the header that
>> every other pgdump file contains, including the internal version
>> number and the unique backup id.
>>
>> The tar format doesn't support compression so going from one to the
>> other would only work for an uncompressed archive and special care
>> must be taken to get the order of the tar file right.
>
> Hmm, tar format doesn't support compression, but looks like the file format
> issue has been thought of already: there's still code there to add .gz
> suffix for compressed files. How about adopting that convention in the
> directory format too? That would make an uncompressed directory format
> compatible with the tar format.

So what you could do is dump in the tar format, untar and restore in
the directory format. I see that this sounds nice but still I am not
sure why someone would dump to the tar format in the first place.

But you still cannot go back from the directory archive to the tar
archive because the standard command line tar will not respect the
order of the objects that pg_restore expects in a tar format, right?


> That seems pretty attractive anyway, because you can then dump to a
> directory, and manually gzip the data files later.

The command line gzip will probably add its own header to the file
that pg_restore would need to strip off...

This is a valid use case for people who are concerned with a fast
dump, usually they would dump uncompressed and later compress the
archive. However once we have parallel pg_dump, this advantage
vanishes.


> Now that we have an API for compression in compress_io.c, it probably
> wouldn't be very hard to implement the missing compression support to tar
> format either.

True, but the question to the advantage of the tar format remains :-)


>> A tar archive has the advantage that you can postprocess the dump data
>> with other tools  but for this we could also add an option that gives
>> you only the data part of a dump file (and uncompresses it at the same
>> time if compressed). Once we have that however, the question is what
>> anybody would then still want to use the tar format for...
>
> I don't know how popular it'll be in practice, but it seems very nice to me
> if you can do things like parallel pg_dump in directory format first, and
> then tar it up to a file for archival.

Yes, but you cannot pg_restore the archive then if it was created with
standard tar, right?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

Reply via email to