Re: Add Unicode Support to the DBI

Martin J. Evans Sat, 10 Sep 2011 03:09:09 -0700

On 10/09/2011 03:52, David E. Wheeler wrote:

DBIers,


tl;dr: I think it's time to add proper Unicode support to the DBI. What do you 
think it should look like?

I'm not sure any change is required to DBI to support unicode. As far asI'm aware unicode already works with DBI if the DBDs do the right thing.

If you stick to the rule that all data Perl receives must be decoded andall data Perl exports must be encoded it works (ignoring any issues inPerl itself).

Background

I've brought this up a time or two in the past, but a number of things have 
happened lately to make me think that it was again time:

First, on the DBD::Pg list, we've been having a discussion about improving the 
DBD::Pg encoding interface.

   http://www.nntp.perl.org/group/perl.dbd.pg/2011/07/msg603.html

That design discussion followed on the extended discussion in this bug report:

   https://rt.cpan.org/Ticket/Display.html?id=40199

Seems that the pg_enable_utf8 flag that's been in DBD::Pg for a long time is 
rather broken in a few ways. Notably, PostgreSQL sends *all* data back to 
clients in a single encoding -- even binary data (which is usually 
hex-encoded). So it made no sense to only decode certain columns. How to go 
about fixing it, though, and adding a useful interface, has proven a bit tricky.

Then there was Tom Christiansen's StackOverflow comment:

   
stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129

I bow to Tom's experience but I'm still not sure how that applies to DBIso long as the interface between the database and Perl always encodesand decodes then the issues Tom describes are all Perl ones - no?

This made me realize that Unicode handling is much trickier than I ever 
realized. But it also emphasized for me how important it is to do everything on 
can to do Unicode right. Tom followed up with a *lot* more detail in three 
OSCON presentations this year, all of which you can read here:

   http://98.245.80.27/tcpc/OSCON2011/index.html

(You're likely gonna want to install the fonts linked at the bottom of that 
page before you read the presentations in HTML).

And finally, I ran into an issue recently with Oracle, where we have an Oracle 
database that should have only UTF-8 data but some row values are actually in 
other encodings. This was a problem because I told DBD::Oracle that the 
encoding was Unicode, and it just blindly turned on the Perl utf8 flag. So I 
got broken data back from the database and then my app crashed when I tried to 
act on a string with the utf8 flag on but containing non-unicode bytes. I 
reported this issue in a DBD::Oracle bug report:

   https://rt.cpan.org/Public/Bug/Display.html?id=70819

Surely Oracle should return the data encoded as you asked for it and ifit did not Oracle is broken.I'd still like to see this case and then we can see if Oracle is brokenand if there is a fix for it.

In some places DBD::Oracle does sv_utf8_decode(scalar) orSvUTF8_on(scalar) (depending on your Perl) and in some places it justdoes SvUTF8_on(scalar). I believe the latter is much quicker as the datais not checked. Many people (myself included) are particularlyinterested in DBD::Oracle being fast and if all the occurrences werechanged to decode I'd patch that out in my copy as I know the data Ireceive is UTF-8 encoded.

But all this together leads me to believe that it's time to examine adding 
explicit Unicode support to the DBI. But it needs to be designed as carefully 
as possible to account for a few key points:

* The API must be as straightforward as possible without sacrificing necessary 
flexibility. I think it should mostly stay out of users ways and have 
reasonable defaults. But it should be clear what each knob we offer does and 
how it affects things. Side-effects should be avoided.

* Ability to enforce the correctness of encoding and decoding must be given 
priority. Perl has pretty specific ideas about is and is not Unicode, so we 
should respect that as much as possible. If that means encoding and decoding 
rather than just flipping the utf8 bit, then fine.

See above. I'd like the chance to go with speed and take theconsequences rather than go with slower but know incorrect UTF-8 is spotted.

* The performance impact must be kept as minimal as possible. So if we can get away with 
just flipping the UTF-8 bit on and off, it should be so. I'm not entirely clear on that, 
though, since Perl's internal representation, called "utf8", is not the same 
thing as UTF-8. But if there's an efficient way to convert between the two, then it 
should be adopted. For other encodings, obviously a full encode/decode path must be 
followed.

I thought UTF-8 when used in Perl used the strict definition and utf-8used Perl's looser definition - seehttp://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8

* Drivers must be able to adopt the API in a straight-forward way. That is to 
say, we need to make sure that the interface covers what most (all?) drivers 
need. Some, like DBD::Pg, can specify that only one encoding come back from the 
database. Maybe others (DBD::mysql) can have individual columns in different 
encodings? It needs to cover that case, too.

* It must be able to give the drivers some flexibility. Where we can't account 
for everything that all drivers need forever, we should make it possible for 
them to add what they need without changing the overall API or the meaning of 
the interfaces provided by the DBI.

I'm not at all clear what such an API should look like. Based on my extensive 
experience with DBD::Pg, a fair amount of experience with DBD::SQLite, and 
limited experience with DBD::Oracle and DBD::mysql, I'd say it'd be useful to 
have at least these knobs:

1. An attribute indicating the database encoding. This is the encoding one 
expects all data coming from the database to be in. When this is set, the DBI 
or the driver would decode incoming data to Perl's internal format and encode 
data sent to the database.

Don't DBDs do this now? I know the encoding of the data I receive inDBD::ODBC and decode it when I get it and encode it when I send it and Ibelieve that is what DBD::Oracle does as well. There is one exception inODBC for drivers which don't truly abide by ODBC spec and send 8 bitdata back UTF-8 encoded (see later).

2. A fourth param to bind_param() to indicate the encoding in which to send 
column data to the database. Defaults to the database encoding.

3. A new parameter to prepare() to indicate the encodings of specific columns 
to be selected.

4. An ENCODING attribute on statement handles that indicates the encoding of 
each columns.

This is just a preliminary proposal, but covers most of the basics, I think. 
(I'm sure I'm suggesting the wrong places for some things). It does assume that 
one wants one's text data to always be decoded into Perl's internal form and 
encoded when sent to the database. There ought to be a way for one also to just 
continue to get binary data and encode and decode one's self (e.g. for actual 
binary columns).

I know that Tim Bunce has thought about this some in the past, and Greg Sabino 
Mullane and I have discussed it quite a lot with Dave Rolsky and others. So I 
know that folks have some ideas about this stuff. So let's hear them. Let's put 
our minds together and try to come up with an interface that we can all work 
with.

Thanks,

David

I've spent a lot of effort getting unicode working in DBD::ODBC (forUNIX and with patches from Alexander Foken for Windows) which isimplemented in an awkward fashion in ODBC. I'd like to hear from DBDauthors what support they already have and how it is implemented so wecan see what ground is already covered and where the problems were. as Iremain unconvinced a problem exists other than incorrectly coded DBDs.I'm happy to collate that information. As a start I'll describe theDBD::ODBC:

1. ODBC has 2 sets of APIs, SQLxxxA (each chr is 8 bits) and SQLxxxW(each chr is 16 bits and UCS-2). This is how Microsoft did it and yes Iknow that does not support all of unicode but code pages get involved too.

2. You select which API you are using with a macro when you compile yourapplication so you cannot change your mind.You can in theory call SQLxxxA or SQLxxxW functions directly but if youuse SQLxxx you get the A or W depending on what the macro is set to.

Problem: DBD::ODBC has to built one way or the other.

3. When using the SQLxxxA functions you can still bindcolumns/parameters as wide characters but the ODBC driver needs tosupport this.

4. When using SQLxxxW functions all strings are expected in UCS-2. Youcan bind columns and parameters as whatever type you like but obviouslyif you bind a unicode column as SQLCHAR instead of SQLWCHAR you probablyget the wrong data back.

5. If you write a unicode ODBC application using SQLxxxW functions andyou connect to an ODBC driver which does not support SQLxxxW functionsthe ODBC driver manager drops the high byte and passes the data to thedrivers SQLxxxA function and similarly the other way around if the appuses SQLxxxA fns but the driver supports SQLxxxW fns the ODBC drivermanager converts the 8 bit data to UCS-2 and calls the SQLxxxW fns inthe driver. The driver manager does not get involved in conversions onbound column or parameter data.Problem: the non-windows driver managers have implemented unicodesupport in this respect differently and I do not support them all whenbuilt for unicode.unixODBC does what Microsoft does - supported.http://search.cpan.org/~mjevans/DBD-ODBC-1.31/ODBC.pm#Unicode describeswhat unixODBC does and how you can avoid some conversions.

iODBC uses wchar_t types (not supported).

DataDirect ODBC driver manager uses UTF-8 (which has other problems inODBC - see http://www.martin-evans.me.uk/node/20#unicode) and isunsupported when built for unicode but because this driver uses UTF-8you can use unicode when DBD::ODBC is not built for unicode.

6. usernames, passwords, column names, table names, SQL, DSNs etc canall be unicode.Problem: DBI did not used to support supplying a scalar to the connectmethod - it was a char *. I changed DBI ages ago (1.607) to add anotherconnect method to pass scalar username, password, DSN string to supportunicode in connect parameters.

Fortunately the prepare method already passed a scalar.

I don't think there are problems with the other DBI methods as I thinkthey all use scalars and not char *.

7. When DBD::ODBC is built for unicode (the default on Windows and anoption on other platforms but not the default) DBD::ODBC has to convertall that UCS-2 data to UTF-8 and vice versa.Problem: this is time consuming so the code to do this is included inDBD::ODBC and written in C using either Windows API calls for Windowsand code included with DBD::ODBC for other platforms.


8. DBD::ODBC does not support unicode parameter names. You cannot do:

select * from table where column = :unicode_param_name

9. you need at least Perl 5.8.1 as it is fundamentally broken before then.

10. When using the wide APIs in ODBC some drivers will report differentcolumn types for columns. For instances when listing tables, columns etcsome drivers (e.g., Microsoft SQL Server) will report the column typesas wide types even if the strings actually fit in 7-bit ASCII.

11. Some ODBC drivers do not support the SQLxxxW APIs but can returndata UTF-8 encoded (e.g., PostgreSQL). DBD::ODBC has no idea the data isUTF-8 encoded so someone sent me a patch to add the odbc_utf8_onparameter which tells DBD::ODBC all data is UTF-8 encoded.


12. To signal to Perl data is UTF-8 DBD::ODBC does:

#ifdef sv_utf8_decode
                    sv_utf8_decode(sv);
#else
                    SvUTF8_on(sv);
#endif

I believe the former does check the encoding, the latter does not butnot all Perls have the former.

I use DBD::Oracle and DBD::ODBC a lot and I know you can send andreceive unicode data reliably because both those DBDs decode/encode datafrom/to the database. In fact, for Oracle we have masses of unicode datahere in all sorts of languages including Chinese and it is working fine.

I guess I just don't see what is broken in DBI as it stands although I'mnot ruling out it might be able to do a little more to help a DBD.


Martin

Re: Add Unicode Support to the DBI

Reply via email to