On 10/09/2011 03:52, David E. Wheeler wrote:
DBIers,

tl;dr: I think it's time to add proper Unicode support to the DBI. What do you 
think it should look like?
I'm not sure any change is required to DBI to support unicode. As far as I'm aware unicode already works with DBI if the DBDs do the right thing.

If you stick to the rule that all data Perl receives must be decoded and all data Perl exports must be encoded it works (ignoring any issues in Perl itself).

Background

I've brought this up a time or two in the past, but a number of things have 
happened lately to make me think that it was again time:

First, on the DBD::Pg list, we've been having a discussion about improving the 
DBD::Pg encoding interface.

   http://www.nntp.perl.org/group/perl.dbd.pg/2011/07/msg603.html

That design discussion followed on the extended discussion in this bug report:

   https://rt.cpan.org/Ticket/Display.html?id=40199

Seems that the pg_enable_utf8 flag that's been in DBD::Pg for a long time is 
rather broken in a few ways. Notably, PostgreSQL sends *all* data back to 
clients in a single encoding -- even binary data (which is usually 
hex-encoded). So it made no sense to only decode certain columns. How to go 
about fixing it, though, and adding a useful interface, has proven a bit tricky.

Then there was Tom Christiansen's StackOverflow comment:

   
stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129
I bow to Tom's experience but I'm still not sure how that applies to DBI so long as the interface between the database and Perl always encodes and decodes then the issues Tom describes are all Perl ones - no?
This made me realize that Unicode handling is much trickier than I ever 
realized. But it also emphasized for me how important it is to do everything on 
can to do Unicode right. Tom followed up with a *lot* more detail in three 
OSCON presentations this year, all of which you can read here:

   http://98.245.80.27/tcpc/OSCON2011/index.html

(You're likely gonna want to install the fonts linked at the bottom of that 
page before you read the presentations in HTML).

And finally, I ran into an issue recently with Oracle, where we have an Oracle 
database that should have only UTF-8 data but some row values are actually in 
other encodings. This was a problem because I told DBD::Oracle that the 
encoding was Unicode, and it just blindly turned on the Perl utf8 flag. So I 
got broken data back from the database and then my app crashed when I tried to 
act on a string with the utf8 flag on but containing non-unicode bytes. I 
reported this issue in a DBD::Oracle bug report:

   https://rt.cpan.org/Public/Bug/Display.html?id=70819
Surely Oracle should return the data encoded as you asked for it and if it did not Oracle is broken. I'd still like to see this case and then we can see if Oracle is broken and if there is a fix for it.

In some places DBD::Oracle does sv_utf8_decode(scalar) or SvUTF8_on(scalar) (depending on your Perl) and in some places it just does SvUTF8_on(scalar). I believe the latter is much quicker as the data is not checked. Many people (myself included) are particularly interested in DBD::Oracle being fast and if all the occurrences were changed to decode I'd patch that out in my copy as I know the data I receive is UTF-8 encoded.

But all this together leads me to believe that it's time to examine adding 
explicit Unicode support to the DBI. But it needs to be designed as carefully 
as possible to account for a few key points:

* The API must be as straightforward as possible without sacrificing necessary 
flexibility. I think it should mostly stay out of users ways and have 
reasonable defaults. But it should be clear what each knob we offer does and 
how it affects things. Side-effects should be avoided.

* Ability to enforce the correctness of encoding and decoding must be given 
priority. Perl has pretty specific ideas about is and is not Unicode, so we 
should respect that as much as possible. If that means encoding and decoding 
rather than just flipping the utf8 bit, then fine.
See above. I'd like the chance to go with speed and take the consequences rather than go with slower but know incorrect UTF-8 is spotted.

* The performance impact must be kept as minimal as possible. So if we can get away with 
just flipping the UTF-8 bit on and off, it should be so. I'm not entirely clear on that, 
though, since Perl's internal representation, called "utf8", is not the same 
thing as UTF-8. But if there's an efficient way to convert between the two, then it 
should be adopted. For other encodings, obviously a full encode/decode path must be 
followed.
I thought UTF-8 when used in Perl used the strict definition and utf-8 used Perl's looser definition - see http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8
* Drivers must be able to adopt the API in a straight-forward way. That is to 
say, we need to make sure that the interface covers what most (all?) drivers 
need. Some, like DBD::Pg, can specify that only one encoding come back from the 
database. Maybe others (DBD::mysql) can have individual columns in different 
encodings? It needs to cover that case, too.

* It must be able to give the drivers some flexibility. Where we can't account 
for everything that all drivers need forever, we should make it possible for 
them to add what they need without changing the overall API or the meaning of 
the interfaces provided by the DBI.

I'm not at all clear what such an API should look like. Based on my extensive 
experience with DBD::Pg, a fair amount of experience with DBD::SQLite, and 
limited experience with DBD::Oracle and DBD::mysql, I'd say it'd be useful to 
have at least these knobs:

1. An attribute indicating the database encoding. This is the encoding one 
expects all data coming from the database to be in. When this is set, the DBI 
or the driver would decode incoming data to Perl's internal format and encode 
data sent to the database.
Don't DBDs do this now? I know the encoding of the data I receive in DBD::ODBC and decode it when I get it and encode it when I send it and I believe that is what DBD::Oracle does as well. There is one exception in ODBC for drivers which don't truly abide by ODBC spec and send 8 bit data back UTF-8 encoded (see later).
2. A fourth param to bind_param() to indicate the encoding in which to send 
column data to the database. Defaults to the database encoding.

3. A new parameter to prepare() to indicate the encodings of specific columns 
to be selected.

4. An ENCODING attribute on statement handles that indicates the encoding of 
each columns.

This is just a preliminary proposal, but covers most of the basics, I think. 
(I'm sure I'm suggesting the wrong places for some things). It does assume that 
one wants one's text data to always be decoded into Perl's internal form and 
encoded when sent to the database. There ought to be a way for one also to just 
continue to get binary data and encode and decode one's self (e.g. for actual 
binary columns).

I know that Tim Bunce has thought about this some in the past, and Greg Sabino 
Mullane and I have discussed it quite a lot with Dave Rolsky and others. So I 
know that folks have some ideas about this stuff. So let's hear them. Let's put 
our minds together and try to come up with an interface that we can all work 
with.

Thanks,

David

I've spent a lot of effort getting unicode working in DBD::ODBC (for UNIX and with patches from Alexander Foken for Windows) which is implemented in an awkward fashion in ODBC. I'd like to hear from DBD authors what support they already have and how it is implemented so we can see what ground is already covered and where the problems were. as I remain unconvinced a problem exists other than incorrectly coded DBDs. I'm happy to collate that information. As a start I'll describe the DBD::ODBC:

1. ODBC has 2 sets of APIs, SQLxxxA (each chr is 8 bits) and SQLxxxW (each chr is 16 bits and UCS-2). This is how Microsoft did it and yes I know that does not support all of unicode but code pages get involved too.

2. You select which API you are using with a macro when you compile your application so you cannot change your mind. You can in theory call SQLxxxA or SQLxxxW functions directly but if you use SQLxxx you get the A or W depending on what the macro is set to.
Problem: DBD::ODBC has to built one way or the other.

3. When using the SQLxxxA functions you can still bind columns/parameters as wide characters but the ODBC driver needs to support this.

4. When using SQLxxxW functions all strings are expected in UCS-2. You can bind columns and parameters as whatever type you like but obviously if you bind a unicode column as SQLCHAR instead of SQLWCHAR you probably get the wrong data back.

5. If you write a unicode ODBC application using SQLxxxW functions and you connect to an ODBC driver which does not support SQLxxxW functions the ODBC driver manager drops the high byte and passes the data to the drivers SQLxxxA function and similarly the other way around if the app uses SQLxxxA fns but the driver supports SQLxxxW fns the ODBC driver manager converts the 8 bit data to UCS-2 and calls the SQLxxxW fns in the driver. The driver manager does not get involved in conversions on bound column or parameter data. Problem: the non-windows driver managers have implemented unicode support in this respect differently and I do not support them all when built for unicode. unixODBC does what Microsoft does - supported. http://search.cpan.org/~mjevans/DBD-ODBC-1.31/ODBC.pm#Unicode describes what unixODBC does and how you can avoid some conversions.
iODBC uses wchar_t types (not supported).
DataDirect ODBC driver manager uses UTF-8 (which has other problems in ODBC - see http://www.martin-evans.me.uk/node/20#unicode) and is unsupported when built for unicode but because this driver uses UTF-8 you can use unicode when DBD::ODBC is not built for unicode.

6. usernames, passwords, column names, table names, SQL, DSNs etc can all be unicode. Problem: DBI did not used to support supplying a scalar to the connect method - it was a char *. I changed DBI ages ago (1.607) to add another connect method to pass scalar username, password, DSN string to support unicode in connect parameters.
Fortunately the prepare method already passed a scalar.
I don't think there are problems with the other DBI methods as I think they all use scalars and not char *.

7. When DBD::ODBC is built for unicode (the default on Windows and an option on other platforms but not the default) DBD::ODBC has to convert all that UCS-2 data to UTF-8 and vice versa. Problem: this is time consuming so the code to do this is included in DBD::ODBC and written in C using either Windows API calls for Windows and code included with DBD::ODBC for other platforms.

8. DBD::ODBC does not support unicode parameter names. You cannot do:

select * from table where column = :unicode_param_name

9. you need at least Perl 5.8.1 as it is fundamentally broken before then.

10. When using the wide APIs in ODBC some drivers will report different column types for columns. For instances when listing tables, columns etc some drivers (e.g., Microsoft SQL Server) will report the column types as wide types even if the strings actually fit in 7-bit ASCII.

11. Some ODBC drivers do not support the SQLxxxW APIs but can return data UTF-8 encoded (e.g., PostgreSQL). DBD::ODBC has no idea the data is UTF-8 encoded so someone sent me a patch to add the odbc_utf8_on parameter which tells DBD::ODBC all data is UTF-8 encoded.

12. To signal to Perl data is UTF-8 DBD::ODBC does:

#ifdef sv_utf8_decode
                    sv_utf8_decode(sv);
#else
                    SvUTF8_on(sv);
#endif


I believe the former does check the encoding, the latter does not but not all Perls have the former.

I use DBD::Oracle and DBD::ODBC a lot and I know you can send and receive unicode data reliably because both those DBDs decode/encode data from/to the database. In fact, for Oracle we have masses of unicode data here in all sorts of languages including Chinese and it is working fine.

I guess I just don't see what is broken in DBI as it stands although I'm not ruling out it might be able to do a little more to help a DBD.

Martin

Reply via email to