Re: [DBD::Pg] Another stab at the UTF-8 system

Tim Bunce Wed, 03 Jul 2013 04:12:06 -0700

On Wed, Jul 03, 2013 at 01:26:19AM -0000, Greg Sabino Mullane wrote:
> 
> David E. Wheeler wrote:
> 
> > What happens if the client encoding is *not* UTF8? 
> 
> If not UTF8, we don't do anything. I think it is sufficient that we simply 
> require people to use UTF8 as their client_encoding if they want DBD::Pg 
> to do the right thing. It's very common, and more importantly, is the only 
> encoding guaranteed to auto convert from any server encoding.


It would be worth mentioning the PGCLIENTENCODING env var in the docs,
and the fact that it can be set to "auto" to "determine the right
encoding from the current locale in the client (LC_CTYPE environment
variable on Unix systems)."

> > Will it turn on the flag for all data without regard to type?
> 
> Yes.

The doc says "for all strings coming back" which is possibly a little
ambiguous.

(After poking about in the code and libpq docs I'm wondering if
PQfformat() should be used to confirm that a field is "textual"
before applying SvUTF8_on.)

Ideally the docs would have a section on Unicode that discusses it in
relation to SQL statements, placeholders, attributes (like $sth->{NAME}),
array stringification and error messages. I.e. at least all the places
that have SvUTF8/_on/_off() calls, plus anywhere that character data
gets passed into libpq.


> > This looks like a good compromise to me: setting it to a boolean retains 
> > the previous behavior (more or less, unless setting it to 1 still converts 
> > it for specific types), and the new default is much saner (assuming 
> > that it applies to *all* types).
> 
> ...
> A lot of this is not going to have any perfect answers, especially as far 
> as backwards compatibilty goes, and forward compatibility with DBI 
> support. But we need to get moving, and I think this is a pretty good 
> first effort.

I agree, and I'm delighted to this.


I would urge you to implement good test coverage for unicode support.
We found all sorts of issues while implementing it for DBD::Oracle
a few years ago (including several bugs in Oracle).

There are some good unicode stress tests in DBD::Oracle.
See https://metacpan.org/source/PYTHIAN/DBD-Oracle-1.64/t/nchar_test_lib.pl
as used by
    https://metacpan.org/source/PYTHIAN/DBD-Oracle-1.64/t/22nchar_utf8.t
and https://metacpan.org/source/PYTHIAN/DBD-Oracle-1.64/t/23wide_db_al32utf8.t

A key part of that is the use of DUMP() to verify that the server itself
has the right representation. Otherwise it's possible to have cases
where characters go in and come back as UTF8, so all seems fine, but the
server doesn't interpret the stored value as the same characters.
Something that returns _character_ length would probably suffice.
That's possibly less of an issue for postgres, but I'd recommend it.

The unicode docs in DBD::Oracle mainly talk about edge cases
https://metacpan.org/module/DBD::Oracle#UNICODE
but there might be some useful notes.

I'd also recommend using the data_string_desc, data_string_diff
and data_diff functions https://metacpan.org/module/DBI#data_string_desc
I wrote them for my own sanity while working on unicode support in
DBD::Oracle and they proved very useful. (It's easy to be fooled
when working with UTF8.)

Tim.

p.s. How can I subscribe to the commits mailing list?

Re: [DBD::Pg] Another stab at the UTF-8 system

Reply via email to