Re: Add Unicode Support to the DBI

Martin J. Evans Thu, 22 Sep 2011 02:26:36 -0700

On 21/09/11 21:52, Greg Sabino Mullane wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

...

And maybe that's the default. But I should be able to tell it to be pedantic 
when the
data is known to be bad (see, for example data from an SQL_ASCII-encoded 
PostgreSQL database).

...

DBD::Pg's approach is currently broken. Greg is working on fixing it, but for 
compatibility
reasons the fix is non-trivial (an the API might be, too). In a perfect world 
DBD::Pg would
just always do the right thing, as the database tells it what encodings to use 
when you
connect (and *all* data is encoded as such, not just certain data types). But 
the world is
not perfect, there's a lot of legacy stuff.

Greg, care to add any other details?


My thinking on this has changed a bit. See the DBD::Pg in git head for a 
sample, but basically,
DBD::Pg is going to:

* Flip the flag on if the client_encoding is UTF-8 (and server_encoding is not 
SQL_ASCII)
* Flip if off if not

The single switch will be pg_unicode_flag, which will basiccaly override the 
automatic
choice above, just in case you really want your SQL_ASCII byte soup marked as 
utf8 for
some reason, or (more likely), you want your data unmarked as utf8 despite 
being so.

This does rely on PostgreSQL doing the right thing when it comes to 
encoding/decoding/storing
all the encodings, but I'm pretty sure it's doing well in that regard.

...

Since nobody has actally defined a specific interface yet, let me throw out a
straw man. It may look familiar :)

===
* $h->{unicode_flag}

If this is set on, data returned from the database is assumed to be UTF-8, and
the utf8 flag will be set. DBDs will decode the data as needed.


There is more than one way to encode unicode - not everyone uses UTF-8; 
although some encodings don't support all of unicode.

unicode is not encoded as UTF-8 in ODBC using the wide APIs.

Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes 
it. Using the ANSI APIs data is returned as octets and is whatever it is - it 
may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe 
they are flawed anyway) it may be something else in which case the application 
needs to know what it is. In the case of octets which are UTF-8 encoded 
DBD::ODBC has no idea that is the case unless you tell it and it will then set 
the UTF-8 flag (but see later).

If this is set off, the utf8 flag will never be set, and no decoding will be 
done
on data coming back from the database.

If this is not set (undefined), the underlying DBD is responsible for doing the
correct thing. In other words, the behaviour is undefined.
===

I don't think this will fit into DBD::Pgs current implementation perfectly, as
we wouldn't want people to simply leave $h->{unicode_flag} on, as that would
force SQL_ASCII text to have utf8 flipped on. Perhaps we simply never, ever
allow that.


I'm not that familiar with Postgres (I've used a few times and not to any great 
degree) and I used MySQL for a while years ago. I occasionally use SQLite. I do 
use DBD::Oracle and DBD::ODBC all the time. I'm still struggling to see the 
problem that needs fixing. Is it just that some people would like a DBI flag 
which tells the DBD:

1) decode any data coming back from the database strictly such that if it is 
invalid you die
2) decode any data coming back from the database loosely (e.g., utf-8 vs UTF-8)
3) don't decode the data from the database at all
4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply sets 
the UTF-8 flag (which from what I read is horribly flawed but seems to work for 
me).

and the reverse.

DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g. 
error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it 
does 4 on most of the column data IF Oracle is sending UTF-8.

DBD::ODBC does nothing via the ANSI APIs unless the odbc_utf8 flag is turned on 
in which case it does 4 (and it only does this because there is apparently a 
version of the Postgres ODBC driver out there somewhere that returns UTF-8 
encoded data, I've never seen it, I just accepted the patch).

DBD::ODBC does 1 if using the wide APIs and it has little choice since no one 
would want to accept UCS2 and have to decode it all the time.

My point being, doesn't the DBD know how the data is encoded when it gets it 
from the database? and it would hopefully know what the database needs when 
sending data. Perhaps in some conditions the DBD does not know this and needs 
to be told (I could imagine SQLite reading/writing straight to files for 
instance might want to know to open the file with UTF-8 layer).

So is the problem that sometimes a DBD does not know what to encode data being 
sent to the database or how/whether to decode data coming back from the 
database? and if that is the case do we need some settings in DBI to tell a DBD?

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com

Re: Add Unicode Support to the DBI

Reply via email to