On 22/09/2011 17:36, David E. Wheeler wrote:
On Sep 22, 2011, at 2:26 AM, Martin J. Evans wrote:
There is more than one way to encode unicode - not everyone uses UTF-8;
although some encodings don't support all of unicode.
Yeah, maybe should be utf8_flag instead.
see below.
unicode is not encoded as UTF-8 in ODBC using the wide APIs.
Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes
it. Using the ANSI APIs data is returned as octets and is whatever it is - it
may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe
they are flawed anyway) it may be something else in which case the application
needs to know what it is. In the case of octets which are UTF-8 encoded
DBD::ODBC has no idea that is the case unless you tell it and it will then set
the UTF-8 flag (but see later).
Right. There needs to be a way to tell the DBI what encoding the server sends
and expects to be sent. If it's not UTF-8, then the utf8_flag option is kind of
useless.
I think this was my point above, i.e., why utf8? databases accept and
supply a number of encodings so why have a flag called utf8? are we
going to have ucs2, utf16, utf32 flags as well. Surely, it makes more
sense to have a flag where you can set the encoding in the same form
Encode uses.
I'm not that familiar with Postgres (I've used a few times and not to any great
degree) and I used MySQL for a while years ago. I occasionally use SQLite. I do
use DBD::Oracle and DBD::ODBC all the time. I'm still struggling to see the
problem that needs fixing. Is it just that some people would like a DBI flag
which tells the DBD:
1) decode any data coming back from the database strictly such that if it is
invalid you die
2) decode any data coming back from the database loosely (e.g., utf-8 vs UTF-8)
3) don't decode the data from the database at all
4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply sets
the UTF-8 flag (which from what I read is horribly flawed but seems to work for
me).
and the reverse.
Yes, with one API for all drivers, if possible, and guidelines for how it
should work (when to encode and decode, what to encode and decode, when to just
flip the utf8 flag on and off, etc.).
ok
DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g.
error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it
does 4 on most of the column data IF Oracle is sending UTF-8.
Yeah, but to enable it *you set a bloody environment variable*. WHAT?
Unless I'm mistaken as to what you refer to I believe that is a feature
of the Oracle client libraries and not one of DBD::Oracle so there is
little we can do about that.
My point being, doesn't the DBD know how the data is encoded when it gets it
from the database? and it would hopefully know what the database needs when
sending data. Perhaps in some conditions the DBD does not know this and needs
to be told (I could imagine SQLite reading/writing straight to files for
instance might want to know to open the file with UTF-8 layer).
Or to turn it off, so you can just pass the encoded UTF-8 through to the file
without the decode/encode round-trip.
So is the problem that sometimes a DBD does not know what to encode data being
sent to the database or how/whether to decode data coming back from the
database? and if that is the case do we need some settings in DBI to tell a DBD?
That's an issue, yes, but the main issue is that all the drivers do it
differently, sometimes with different semantics, and lack all the functionality
one might want (e.g., your examples 1-4).
Best,
David
So to try and move forward, we'd we talking about a flag or flags which say:
1 encode the data sent to the database like "this" (which could be nothing)
2 decode the data retrieved from the database like "this" (which could
be nothing but if not nothing it could be using strict or loose for the
UTF-8 and utf-8 case)
3 don't decode but use SvUTF8_on (a specific case since Perl uses that
internally and a number of database return UTF-8)
one that seems to work but I worry about.
4 do what the DBD thinks is best - whatever the behaviour is now?
and what about when it conflicts with your locale/LANG?
and what about PERL_UNICODE flags, do they come into this?
and what about when the DBD knows you are wrong because the database
says it is returning data in encoding X but you ask for Y.
and for DBD::ODBC built for unicode API am I expected to try and decode
UCS2 as x just because the flag tells me to and I know it will not work?
Seems like it only applies to the ANSI API in DBD::ODBC where the data
could be UTF-8 encoded in a few (possibly broken see
http://www.martin-evans.me.uk/node/20#unicode) cases.
I still think it would help to name some specific cases per DBD of flags
in use and why they exist:
DBD::ODBC has a odbc_utf8_on flag to say that data returned by the
database when using the ANSI APIs is UTF-8 encoded and currently it
calls SvUTF8_on on that data (I've never used or verified it works
myself but the person supplying the patch said it had a purpose with a
particular Postgres based database). Beyond that DBD::ODBC has no other
flags as it knows in the unicode/wide APIs the data is UCS2 encoded and
it checks it is valid when decoding it. Similarly when sending data to
the database in the wide APIs it takes the Perl scalar and encodes it in
UCS2.
DBD::Oracle to my knowledge has no special flags; it just attempts to do
the right thing but it favours speed so most data that is supposed to be
UTF-8 encoded has SvUTF8_on set but in one case (error messages) it
properly and strictly decodes the message so long as your Perl is recent
enough else it uses SvUTF8_on.
So, what are the other flags in use and what purpose do they fulfil.
Martin