On 10/09/2011 03:52, David E. Wheeler wrote:
DBIers,
tl;dr: I think it's time to add proper Unicode support to the DBI. What do you
think it should look like?
I'm not sure any change is required to DBI to support unicode. As far as
I'm aware unicode already works with DBI if the DBDs do the right thing.
If you stick to the rule that all data Perl receives must be decoded and
all data Perl exports must be encoded it works (ignoring any issues in
Perl itself).
Background
I've brought this up a time or two in the past, but a number of things have
happened lately to make me think that it was again time:
First, on the DBD::Pg list, we've been having a discussion about improving the
DBD::Pg encoding interface.
http://www.nntp.perl.org/group/perl.dbd.pg/2011/07/msg603.html
That design discussion followed on the extended discussion in this bug report:
https://rt.cpan.org/Ticket/Display.html?id=40199
Seems that the pg_enable_utf8 flag that's been in DBD::Pg for a long time is
rather broken in a few ways. Notably, PostgreSQL sends *all* data back to
clients in a single encoding -- even binary data (which is usually
hex-encoded). So it made no sense to only decode certain columns. How to go
about fixing it, though, and adding a useful interface, has proven a bit tricky.
Then there was Tom Christiansen's StackOverflow comment:
stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129
I bow to Tom's experience but I'm still not sure how that applies to DBI
so long as the interface between the database and Perl always encodes
and decodes then the issues Tom describes are all Perl ones - no?
This made me realize that Unicode handling is much trickier than I ever
realized. But it also emphasized for me how important it is to do everything on
can to do Unicode right. Tom followed up with a *lot* more detail in three
OSCON presentations this year, all of which you can read here:
http://98.245.80.27/tcpc/OSCON2011/index.html
(You're likely gonna want to install the fonts linked at the bottom of that
page before you read the presentations in HTML).
And finally, I ran into an issue recently with Oracle, where we have an Oracle
database that should have only UTF-8 data but some row values are actually in
other encodings. This was a problem because I told DBD::Oracle that the
encoding was Unicode, and it just blindly turned on the Perl utf8 flag. So I
got broken data back from the database and then my app crashed when I tried to
act on a string with the utf8 flag on but containing non-unicode bytes. I
reported this issue in a DBD::Oracle bug report:
https://rt.cpan.org/Public/Bug/Display.html?id=70819
Surely Oracle should return the data encoded as you asked for it and if
it did not Oracle is broken.
I'd still like to see this case and then we can see if Oracle is broken
and if there is a fix for it.
In some places DBD::Oracle does sv_utf8_decode(scalar) or
SvUTF8_on(scalar) (depending on your Perl) and in some places it just
does SvUTF8_on(scalar). I believe the latter is much quicker as the data
is not checked. Many people (myself included) are particularly
interested in DBD::Oracle being fast and if all the occurrences were
changed to decode I'd patch that out in my copy as I know the data I
receive is UTF-8 encoded.
But all this together leads me to believe that it's time to examine adding
explicit Unicode support to the DBI. But it needs to be designed as carefully
as possible to account for a few key points:
* The API must be as straightforward as possible without sacrificing necessary
flexibility. I think it should mostly stay out of users ways and have
reasonable defaults. But it should be clear what each knob we offer does and
how it affects things. Side-effects should be avoided.
* Ability to enforce the correctness of encoding and decoding must be given
priority. Perl has pretty specific ideas about is and is not Unicode, so we
should respect that as much as possible. If that means encoding and decoding
rather than just flipping the utf8 bit, then fine.
See above. I'd like the chance to go with speed and take the
consequences rather than go with slower but know incorrect UTF-8 is spotted.
* The performance impact must be kept as minimal as possible. So if we can get away with
just flipping the UTF-8 bit on and off, it should be so. I'm not entirely clear on that,
though, since Perl's internal representation, called "utf8", is not the same
thing as UTF-8. But if there's an efficient way to convert between the two, then it
should be adopted. For other encodings, obviously a full encode/decode path must be
followed.
I thought UTF-8 when used in Perl used the strict definition and utf-8
used Perl's looser definition - see
http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8
* Drivers must be able to adopt the API in a straight-forward way. That is to
say, we need to make sure that the interface covers what most (all?) drivers
need. Some, like DBD::Pg, can specify that only one encoding come back from the
database. Maybe others (DBD::mysql) can have individual columns in different
encodings? It needs to cover that case, too.
* It must be able to give the drivers some flexibility. Where we can't account
for everything that all drivers need forever, we should make it possible for
them to add what they need without changing the overall API or the meaning of
the interfaces provided by the DBI.
I'm not at all clear what such an API should look like. Based on my extensive
experience with DBD::Pg, a fair amount of experience with DBD::SQLite, and
limited experience with DBD::Oracle and DBD::mysql, I'd say it'd be useful to
have at least these knobs:
1. An attribute indicating the database encoding. This is the encoding one
expects all data coming from the database to be in. When this is set, the DBI
or the driver would decode incoming data to Perl's internal format and encode
data sent to the database.
Don't DBDs do this now? I know the encoding of the data I receive in
DBD::ODBC and decode it when I get it and encode it when I send it and I
believe that is what DBD::Oracle does as well. There is one exception in
ODBC for drivers which don't truly abide by ODBC spec and send 8 bit
data back UTF-8 encoded (see later).
2. A fourth param to bind_param() to indicate the encoding in which to send
column data to the database. Defaults to the database encoding.
3. A new parameter to prepare() to indicate the encodings of specific columns
to be selected.
4. An ENCODING attribute on statement handles that indicates the encoding of
each columns.
This is just a preliminary proposal, but covers most of the basics, I think.
(I'm sure I'm suggesting the wrong places for some things). It does assume that
one wants one's text data to always be decoded into Perl's internal form and
encoded when sent to the database. There ought to be a way for one also to just
continue to get binary data and encode and decode one's self (e.g. for actual
binary columns).
I know that Tim Bunce has thought about this some in the past, and Greg Sabino
Mullane and I have discussed it quite a lot with Dave Rolsky and others. So I
know that folks have some ideas about this stuff. So let's hear them. Let's put
our minds together and try to come up with an interface that we can all work
with.
Thanks,
David
I've spent a lot of effort getting unicode working in DBD::ODBC (for
UNIX and with patches from Alexander Foken for Windows) which is
implemented in an awkward fashion in ODBC. I'd like to hear from DBD
authors what support they already have and how it is implemented so we
can see what ground is already covered and where the problems were. as I
remain unconvinced a problem exists other than incorrectly coded DBDs.
I'm happy to collate that information. As a start I'll describe the
DBD::ODBC:
1. ODBC has 2 sets of APIs, SQLxxxA (each chr is 8 bits) and SQLxxxW
(each chr is 16 bits and UCS-2). This is how Microsoft did it and yes I
know that does not support all of unicode but code pages get involved too.
2. You select which API you are using with a macro when you compile your
application so you cannot change your mind.
You can in theory call SQLxxxA or SQLxxxW functions directly but if you
use SQLxxx you get the A or W depending on what the macro is set to.
Problem: DBD::ODBC has to built one way or the other.
3. When using the SQLxxxA functions you can still bind
columns/parameters as wide characters but the ODBC driver needs to
support this.
4. When using SQLxxxW functions all strings are expected in UCS-2. You
can bind columns and parameters as whatever type you like but obviously
if you bind a unicode column as SQLCHAR instead of SQLWCHAR you probably
get the wrong data back.
5. If you write a unicode ODBC application using SQLxxxW functions and
you connect to an ODBC driver which does not support SQLxxxW functions
the ODBC driver manager drops the high byte and passes the data to the
drivers SQLxxxA function and similarly the other way around if the app
uses SQLxxxA fns but the driver supports SQLxxxW fns the ODBC driver
manager converts the 8 bit data to UCS-2 and calls the SQLxxxW fns in
the driver. The driver manager does not get involved in conversions on
bound column or parameter data.
Problem: the non-windows driver managers have implemented unicode
support in this respect differently and I do not support them all when
built for unicode.
unixODBC does what Microsoft does - supported.
http://search.cpan.org/~mjevans/DBD-ODBC-1.31/ODBC.pm#Unicode describes
what unixODBC does and how you can avoid some conversions.
iODBC uses wchar_t types (not supported).
DataDirect ODBC driver manager uses UTF-8 (which has other problems in
ODBC - see http://www.martin-evans.me.uk/node/20#unicode) and is
unsupported when built for unicode but because this driver uses UTF-8
you can use unicode when DBD::ODBC is not built for unicode.
6. usernames, passwords, column names, table names, SQL, DSNs etc can
all be unicode.
Problem: DBI did not used to support supplying a scalar to the connect
method - it was a char *. I changed DBI ages ago (1.607) to add another
connect method to pass scalar username, password, DSN string to support
unicode in connect parameters.
Fortunately the prepare method already passed a scalar.
I don't think there are problems with the other DBI methods as I think
they all use scalars and not char *.
7. When DBD::ODBC is built for unicode (the default on Windows and an
option on other platforms but not the default) DBD::ODBC has to convert
all that UCS-2 data to UTF-8 and vice versa.
Problem: this is time consuming so the code to do this is included in
DBD::ODBC and written in C using either Windows API calls for Windows
and code included with DBD::ODBC for other platforms.
8. DBD::ODBC does not support unicode parameter names. You cannot do:
select * from table where column = :unicode_param_name
9. you need at least Perl 5.8.1 as it is fundamentally broken before then.
10. When using the wide APIs in ODBC some drivers will report different
column types for columns. For instances when listing tables, columns etc
some drivers (e.g., Microsoft SQL Server) will report the column types
as wide types even if the strings actually fit in 7-bit ASCII.
11. Some ODBC drivers do not support the SQLxxxW APIs but can return
data UTF-8 encoded (e.g., PostgreSQL). DBD::ODBC has no idea the data is
UTF-8 encoded so someone sent me a patch to add the odbc_utf8_on
parameter which tells DBD::ODBC all data is UTF-8 encoded.
12. To signal to Perl data is UTF-8 DBD::ODBC does:
#ifdef sv_utf8_decode
sv_utf8_decode(sv);
#else
SvUTF8_on(sv);
#endif
I believe the former does check the encoding, the latter does not but
not all Perls have the former.
I use DBD::Oracle and DBD::ODBC a lot and I know you can send and
receive unicode data reliably because both those DBDs decode/encode data
from/to the database. In fact, for Oracle we have masses of unicode data
here in all sorts of languages including Chinese and it is working fine.
I guess I just don't see what is broken in DBI as it stands although I'm
not ruling out it might be able to do a little more to help a DBD.
Martin