On 22/09/2011 19:28, David E. Wheeler wrote:
On Sep 22, 2011, at 11:14 AM, Martin J. Evans wrote:
Right. There needs to be a way to tell the DBI what encoding the server sends
and expects to be sent. If it's not UTF-8, then the utf8_flag option is kind of
useless.
I think this was my point above, i.e., why utf8? databases accept and supply a
number of encodings so why have a flag called utf8? are we going to have ucs2,
utf16, utf32 flags as well. Surely, it makes more sense to have a flag where
you can set the encoding in the same form Encode uses.
Yes, I agreed with you. :-)
progress then ;-)
Unless I'm mistaken as to what you refer to I believe that is a feature of the
Oracle client libraries and not one of DBD::Oracle so there is little we can do
about that.
Sure you can. I set something via the DBI interface and the DBD sets the
environment variable for the Oracle client libraries.
ok except what the oracle client libraries accept does not match with
Encode accepted strings so someone would have to come up with some sort
of mapping between the two.
e.g., my current setting is:
NLS_LANG=AMERICAN_AMERICA.AL32UTF8
So to try and move forward, we'd we talking about a flag or flags which say:
1 encode the data sent to the database like "this" (which could be nothing)
2 decode the data retrieved from the database like "this" (which could be
nothing but if not nothing it could be using strict or loose for the UTF-8 and utf-8 case)
3 don't decode but use SvUTF8_on (a specific case since Perl uses that
internally and a number of database return UTF-8)
one that seems to work but I worry about.
4 do what the DBD thinks is best - whatever the behaviour is now?
Yes.
and what about when it conflicts with your locale/LANG?
So what?
I'm not so sure this is a "So what" as Perl itself uses locale settings
in some cases - just thought it needed mentioning for consideration.
and what about PERL_UNICODE flags, do they come into this?
What are those?
See http://perldoc.perl.org/perlrun.html
In particular "UTF-8 is the default PerlIO layer for input streams" of
which reading data from a database could be considered one?
and what about when the DBD knows you are wrong because the database says it is
returning data in encoding X but you ask for Y.
Throw an exception or a warning.
ok
and for DBD::ODBC built for unicode API am I expected to try and decode UCS2 as
x just because the flag tells me to and I know it will not work? Seems like it
only applies to the ANSI API in DBD::ODBC where the data could be UTF-8 encoded
in a few (possibly broken see http://www.martin-evans.me.uk/node/20#unicode)
cases.
If the user does something that makes no sense, tell them it makes no sense.
Die if necessary.
I still think it would help to name some specific cases per DBD of flags in use
and why they exist:
DBD::ODBC has a odbc_utf8_on flag to say that data returned by the database
when using the ANSI APIs is UTF-8 encoded and currently it calls SvUTF8_on on
that data (I've never used or verified it works myself but the person supplying
the patch said it had a purpose with a particular Postgres based database).
That's what the new DBD::Pg flag that Greg's working on does, too.
Beyond that DBD::ODBC has no other flags as it knows in the unicode/wide APIs
the data is UCS2 encoded and it checks it is valid when decoding it. Similarly
when sending data to the database in the wide APIs it takes the Perl scalar and
encodes it in UCS2.
Yeah, ideally, by default, if the DBD knows the encoding used by the database, it should
just DTRT. There are backward compatibility issues with that for DBD::Pg, though. So
there probably should be a knob to say "don't do any encoding or decoding at
all", because a lot of older apps likely expect that.
DBD::Oracle to my knowledge has no special flags; it just attempts to do the
right thing but it favours speed so most data that is supposed to be UTF-8
encoded has SvUTF8_on set but in one case (error messages) it properly and
strictly decodes the message so long as your Perl is recent enough else it uses
SvUTF8_on.
So, what are the other flags in use and what purpose do they fulfill.
I think we could really just start with one flag, "encoding". By default the DBD should just try to do the
right thing. If "encoding" is set to ":raw" then it should do no encoding or decoding. If it's set
to ":utf8" it should just turn the flag on or off. If it's set to an actual encoding it should encode and
decode. I think that would be a good start.
Best,
David
ok, I'm thinking through the ramifications of this.
To add to the list I see DBD::SQLite has |sqlite_unicode |"strings
coming from the database and passed to the collation function will be
properly tagged with the utf8 flag; but this only works if the
|sqlite_unicode| attribute is set before the first call to a perl
collation sequence" and "The current FTS3 implementation in SQLite is
far from complete with respect to utf8 handling : in particular,
variable-length characters are not treated correctly by the builtin
functions |offsets()| and |snippet()|."
and DBD::CSV has
f_encoding => "utf8",
DBD::mysql has mysql_enable_utf8 which apparently "This attribute
determines whether DBD::mysql should assume strings stored in the
database are utf8. This feature defaults to off."
I could not find any special flags for DBD::DB2.
DBD::Sybase has syb_enable_utf8 "If this attribute is set then
DBD::Sybase will convert UNIVARCHAR, UNICHAR, and UNITEXT data to Perl's
internal utf-8 encoding when they are retrieved. Updating a unicode
column will cause Sybase to convert any incoming data from utf-8 to its
internal utf-16 encoding."
Martin