On 21/09/11 21:52, Greg Sabino Mullane wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160
...
And maybe that's the default. But I should be able to tell it to be pedantic
when the
data is known to be bad (see, for example data from an SQL_ASCII-encoded
PostgreSQL database).
...
DBD::Pg's approach is currently broken. Greg is working on fixing it, but for
compatibility
reasons the fix is non-trivial (an the API might be, too). In a perfect world
DBD::Pg would
just always do the right thing, as the database tells it what encodings to use
when you
connect (and *all* data is encoded as such, not just certain data types). But
the world is
not perfect, there's a lot of legacy stuff.
Greg, care to add any other details?
My thinking on this has changed a bit. See the DBD::Pg in git head for a
sample, but basically,
DBD::Pg is going to:
* Flip the flag on if the client_encoding is UTF-8 (and server_encoding is not
SQL_ASCII)
* Flip if off if not
The single switch will be pg_unicode_flag, which will basiccaly override the
automatic
choice above, just in case you really want your SQL_ASCII byte soup marked as
utf8 for
some reason, or (more likely), you want your data unmarked as utf8 despite
being so.
This does rely on PostgreSQL doing the right thing when it comes to
encoding/decoding/storing
all the encodings, but I'm pretty sure it's doing well in that regard.
...
Since nobody has actally defined a specific interface yet, let me throw out a
straw man. It may look familiar :)
===
* $h->{unicode_flag}
If this is set on, data returned from the database is assumed to be UTF-8, and
the utf8 flag will be set. DBDs will decode the data as needed.
There is more than one way to encode unicode - not everyone uses UTF-8;
although some encodings don't support all of unicode.
unicode is not encoded as UTF-8 in ODBC using the wide APIs.
Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes
it. Using the ANSI APIs data is returned as octets and is whatever it is - it
may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe
they are flawed anyway) it may be something else in which case the application
needs to know what it is. In the case of octets which are UTF-8 encoded
DBD::ODBC has no idea that is the case unless you tell it and it will then set
the UTF-8 flag (but see later).
If this is set off, the utf8 flag will never be set, and no decoding will be
done
on data coming back from the database.
If this is not set (undefined), the underlying DBD is responsible for doing the
correct thing. In other words, the behaviour is undefined.
===
I don't think this will fit into DBD::Pgs current implementation perfectly, as
we wouldn't want people to simply leave $h->{unicode_flag} on, as that would
force SQL_ASCII text to have utf8 flipped on. Perhaps we simply never, ever
allow that.
I'm not that familiar with Postgres (I've used a few times and not to any great
degree) and I used MySQL for a while years ago. I occasionally use SQLite. I do
use DBD::Oracle and DBD::ODBC all the time. I'm still struggling to see the
problem that needs fixing. Is it just that some people would like a DBI flag
which tells the DBD:
1) decode any data coming back from the database strictly such that if it is
invalid you die
2) decode any data coming back from the database loosely (e.g., utf-8 vs UTF-8)
3) don't decode the data from the database at all
4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply sets
the UTF-8 flag (which from what I read is horribly flawed but seems to work for
me).
and the reverse.
DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g.
error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it
does 4 on most of the column data IF Oracle is sending UTF-8.
DBD::ODBC does nothing via the ANSI APIs unless the odbc_utf8 flag is turned on
in which case it does 4 (and it only does this because there is apparently a
version of the Postgres ODBC driver out there somewhere that returns UTF-8
encoded data, I've never seen it, I just accepted the patch).
DBD::ODBC does 1 if using the wide APIs and it has little choice since no one
would want to accept UCS2 and have to decode it all the time.
My point being, doesn't the DBD know how the data is encoded when it gets it
from the database? and it would hopefully know what the database needs when
sending data. Perhaps in some conditions the DBD does not know this and needs
to be told (I could imagine SQLite reading/writing straight to files for
instance might want to know to open the file with UTF-8 layer).
So is the problem that sometimes a DBD does not know what to encode data being
sent to the database or how/whether to decode data coming back from the
database? and if that is the case do we need some settings in DBI to tell a DBD?
Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com