Re: Add Unicode Support to the DBI

Martin J. Evans Thu, 22 Sep 2011 02:37:34 -0700

David, I forgot to answer your post first and ended up putting most of my 
comments in a reply to Greg's posting - sorry, it was a long night last night. 
Some further comments below:


On 21/09/11 19:44, David E. Wheeler wrote:

On Sep 10, 2011, at 3:08 AM, Martin J. Evans wrote:

I'm not sure any change is required to DBI to support unicode. As
far as I'm aware unicode already works with DBI if the DBDs do the
right thing.


Right, but the problem is that, IME, none of them do "the right
thing." As I said, I've submitted encoding-related bug reports for
every DBD I've used in production code. And they all have different
interfaces for tweaking things.


So it is these "different interfaces for tweaking things" you'd like to 
standardise I presume.

If you stick to the rule that all data Perl receives must be
decoded and all data Perl exports must be encoded it works
(ignoring any issues in Perl itself).


Er, was there supposed to be a ", then …" statement there?

I bow to Tom's experience but I'm still not sure how that applies
to DBI so long as the interface between the database and Perl
always encodes and decodes then the issues Tom describes are all
Perl ones - no?


The trouble is that:

1. They don't always encode or decode 2. When they do, the tend to
get subtle bits wrong 3. And they all have different interfaces and
philosophies for doing so

Surely Oracle should return the data encoded as you asked for it
and if it did not Oracle is broken. I'd still like to see this case
and then we can see if Oracle is broken and if there is a fix for
it.


Oh I don't doubt that Oracle is broken.

In some places DBD::Oracle does sv_utf8_decode(scalar) or
SvUTF8_on(scalar) (depending on your Perl) and in some places it
just does SvUTF8_on(scalar). I believe the latter is much quicker
as the data is not checked. Many people (myself included) are
particularly interested in DBD::Oracle being fast and if all the
occurrences were changed to decode I'd patch that out in my copy as
I know the data I receive is UTF-8 encoded.


IME It needs an "assume Oracle is broken" knob. That is, I should
have the option to enface encoding and decoding, rather than just
flipping SvUTF8. And I think that such an interface should be
standardized in the DBI along with detailed information for driver
authors how how to "get it right."


ok, I get that.

See above. I'd like the chance to go with speed and take the
consequences rather than go with slower but know incorrect UTF-8 is
spotted.


And maybe that's the default. But I should be able to tell it to be
pedantic when the data is known to be bad (see, for example data from
an SQL_ASCII-encoded PostgreSQL database).

I thought UTF-8 when used in Perl used the strict definition and
utf-8 used Perl's looser definition - see
http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8

That's right. So if I want to ensure that I'm getting strict encoding
in my database, It needs to encode and decode, not simply flip
SvUTF8.

Don't DBDs do this now? I know the encoding of the data I receive
in DBD::ODBC and decode it when I get it and encode it when I send
it and I believe that is what DBD::Oracle does as well. There is
one exception in ODBC for drivers which don't truly abide by ODBC
spec and send 8 bit data back UTF-8 encoded (see later).


There is no single API for configuring this in the DBI, and I argue
there should be.

I've spent a lot of effort getting unicode working in DBD::ODBC
(for UNIX and with patches from Alexander Foken for Windows) which
is implemented in an awkward fashion in ODBC. I'd like to hear from
DBD authors what support they already have and how it is
implemented so we can see what ground is already covered and where
the problems were.


DBD::Pg's approach is currently broken. Greg is working on fixing it,
but for compatibility reasons the fix is non-trivial (an the API
might be, too). In a perfect world DBD::Pg would just always do the
right thing, as the database tells it what encodings to use when you
connect (and *all* data is encoded as such, not just certain data
types). But the world is not perfect, there's a lot of legacy stuff.

Greg, care to add any other details?

as I remain unconvinced a problem exists other than incorrectly
coded DBDs. I'm happy to collate that information. As a start I'll
describe the DBD::ODBC:

1. ODBC has 2 sets of APIs, SQLxxxA (each chr is 8 bits) and
SQLxxxW (each chr is 16 bits and UCS-2). This is how Microsoft did
it and yes I know that does not support all of unicode but code
pages get involved too.

2. You select which API you are using with a macro when you compile
your application so you cannot change your mind. You can in theory
call SQLxxxA or SQLxxxW functions directly but if you use SQLxxx
you get the A or W depending on what the macro is set to. Problem:
DBD::ODBC has to built one way or the other.

3. When using the SQLxxxA functions you can still bind
columns/parameters as wide characters but the ODBC driver needs to
support this.

4. When using SQLxxxW functions all strings are expected in UCS-2.
You can bind columns and parameters as whatever type you like but
obviously if you bind a unicode column as SQLCHAR instead of
SQLWCHAR you probably get the wrong data back.

5. If you write a unicode ODBC application using SQLxxxW functions
and you connect to an ODBC driver which does not support SQLxxxW
functions the ODBC driver manager drops the high byte and passes
the data to the drivers SQLxxxA function and similarly the other
way around if the app uses SQLxxxA fns but the driver supports
SQLxxxW fns the ODBC driver manager converts the 8 bit data to
UCS-2 and calls the SQLxxxW fns in the driver. The driver manager
does not get involved in conversions on bound column or parameter
data.


Wow. I'd want an exception in such a situation, personally.


In reality, no one with any sense runs a non-unicode enabled application 
against a unicode database or one that has unicode in it so I doubt it matters.

Problem: the non-windows driver managers have implemented unicode
support in this respect differently and I do not support them all
when built for unicode. unixODBC does what Microsoft does -
supported.
http://search.cpan.org/~mjevans/DBD-ODBC-1.31/ODBC.pm#Unicode
describes what unixODBC does and how you can avoid some
conversions. iODBC uses wchar_t types (not supported). DataDirect
ODBC driver manager uses UTF-8 (which has other problems in ODBC -
see http://www.martin-evans.me.uk/node/20#unicode) and is
unsupported when built for unicode but because this driver uses
UTF-8 you can use unicode when DBD::ODBC is not built for unicode.

6. usernames, passwords, column names, table names, SQL, DSNs etc
can all be unicode. Problem: DBI did not used to support supplying
a scalar to the connect method - it was a char *. I changed DBI
ages ago (1.607) to add another connect method to pass scalar
username, password, DSN string to support unicode in connect
parameters. Fortunately the prepare method already passed a
scalar. I don't think there are problems with the other DBI methods
as I think they all use scalars and not char *.


Excellent. Musta been a bitch to track that down.

7. When DBD::ODBC is built for unicode (the default on Windows and
an option on other platforms but not the default) DBD::ODBC has to
convert all that UCS-2 data to UTF-8 and vice versa. Problem: this
is time consuming so the code to do this is included in DBD::ODBC
and written in C using either Windows API calls for Windows and
code included with DBD::ODBC for other platforms.


Yeah, that's reasonable.

8. DBD::ODBC does not support unicode parameter names. You cannot
do:

select * from table where column = :unicode_param_name

9. you need at least Perl 5.8.1 as it is fundamentally broken
before then.

10. When using the wide APIs in ODBC some drivers will report
different column types for columns. For instances when listing
tables, columns etc some drivers (e.g., Microsoft SQL Server) will
report the column types as wide types even if the strings actually
fit in 7-bit ASCII.

11. Some ODBC drivers do not support the SQLxxxW APIs but can
return data UTF-8 encoded (e.g., PostgreSQL). DBD::ODBC has no idea
the data is UTF-8 encoded so someone sent me a patch to add the
odbc_utf8_on parameter which tells DBD::ODBC all data is UTF-8
encoded.


Yeah, that's the kind of interface I'm talking about. DBD::Pg calls a
similar (but flawed) parameter pg_enable_utf8. DBD::SQLite calls it
sqlite_unicode.


right.

12. To signal to Perl data is UTF-8 DBD::ODBC does:

#ifdef sv_utf8_decode sv_utf8_decode(sv); #else SvUTF8_on(sv);
#endif


I believe the former does check the encoding, the latter does not
but not all Perls have the former.


Makes sense. Would be nice to have the option to turn off the first
item for performance reasons (see note from Lyle in this thread).

I use DBD::Oracle and DBD::ODBC a lot and I know you can send and
receive unicode data reliably because both those DBDs decode/encode
data from/to the database. In fact, for Oracle we have masses of
unicode data here in all sorts of languages including Chinese and
it is working fine.


Yeah, it works fine, I think we have some decent 80-90% solutions.
It's the lack of consistency in the interface and rules between DBDs
that's unfortunate.

I guess I just don't see what is broken in DBI as it stands
although I'm not ruling out it might be able to do a little more to
help a DBD.


It's not broken. The problem is every driver does it differently and
none of them does it quite right. There need to be some standards on
this, IMHO. That's why I was suggesting interface additions to the
DBI in my OP. If you don't accept that such things should be
standardized, that's an discussion we can have. The rest is
implementation details.

Best,

David


So we need to define what the options are? Perhaps some of the ones I put in 
answer to Greg's email.

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com

Re: Add Unicode Support to the DBI

Reply via email to