[HACKERS] pg_dump/restore encoding woes

Heikki Linnakangas Mon, 26 Aug 2013 08:33:46 -0700

pg_dump and pg_restore don't behave very nicely when the client andserver encodings don't match. Below are three issues that arise fromthat. All the examples below use a console with a UTF-8 locale, and the'latin1db' database uses ISO-8859-1 as the database encoding. In thatdatabase, there is a single table called "pöö".


1. pg_dump verbose output
-------------------------


$ pg_dump -d latin1db -Fc -v -f a.backup

...
pg_dump: finding the columns and types of table "p��"
...

When client encoding is not specified explicitly with the -E option, orPGCLIENTENCODING env variable, the dump is created in the server encoding.

Alexander Law reported this bug about a year ago, see bug #6742:http://www.postgresql.org/message-id/[email protected].

Now, you can say that it's the user's fault for not specifyingclient_encoding correctly, but see #2.


2. pg_dump -t option doesn't work if client_encoding is not set
---------------------------------------------------------------

$ ./pg_dump -d latin1db -Fc -t pöö -f a.backup
pg_dump: No matching tables were found

$ ./pg_dump -d latin1db -Fc -t pöö -f a.backup -E utf-8
(success)

The table name given in the argument is passed to the server withouttranslation, so client_encoding needs to be set or the server will notinterpret the table name correctly.

Like #1, this is a user-error - he needs to set client_encodingcorrectly. Other client programs like vacuumdb have the same problem.But we could do better. psql sets client_encoding automatically(client_encoding='auto') based on the locale. Why don't we do the samein all the client programs?

However, pg_dump is special, because client encoding affects not onlythe encoding used to speak to the server, but it also determines how theresulting dump is encoded. If you have a UTF-8 server, and a LATIN1console, there is no way to get a UTF-8 encoded dump of a single tablewhich has non-ASCII characters in its name. There is a good reason towant to dump in the server encoding regardless of the encoding of theclient: that avoids the costly encoding conversion during the dump, andvery likely another conversion back on restore. (as a convenience, itwould be nice if you could specify "-E server" to mean "same as serverencoding")

The pg_dump -E option just sets client_encoding, but I think it would bebetter for -E to only set the encoding used in the dump, andPGCLIENTENCODING env variable (if set) was used to determine theencoding of the command-line arguments. Opinions?


3. pg_restore -t option doesn't work if dump is in different encoding
---------------------------------------------------------------------

$ pg_dump -d latin1db -Fc -f a.backup
$ ./pg_restore -t pöö a.backup
(restores nothing)

pg_restore doesn't convert encodings when it matches the table namegiven with -t option with the table names in the dump. Hence in aboveexample, where the dump is in LATIN1 encoding and the console uses aUTF-8 locale, the table name is not matched even though there is a tablewith that name in the dump.

Unfortunately I don't see any easy solution to this third issue :-(. Wedon't have any infrastructure to do encoding conversions in the client.I guess we could use iconv(3) if it's available, but I'm a bit reluctantto start using that, given that we've managed to do with out client-sideconversions this far. Or we could do the conversion in the server using"convert_from()", but that only works if pg_restore is connected to adatabase. Perhaps it's best to just throw a warning in if -t is used andthe dump's encoding doesn't match the current locale.


- Heikki


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_dump/restore encoding woes

Reply via email to