Re: Add Unicode Support to the DBI

David E. Wheeler Wed, 21 Sep 2011 11:44:22 -0700

On Sep 10, 2011, at 3:08 AM, Martin J. Evans wrote:

> I'm not sure any change is required to DBI to support unicode. As far as I'm 
> aware unicode already works with DBI if the DBDs do the right thing.


Right, but the problem is that, IME, none of them do "the right thing." As I 
said, I've submitted encoding-related bug reports for every DBD I've used in 
production code. And they all have different interfaces for tweaking things.

> If you stick to the rule that all data Perl receives must be decoded and all 
> data Perl exports must be encoded it works (ignoring any issues in Perl 
> itself).

Er, was there supposed to be a ", then …" statement there?

> I bow to Tom's experience but I'm still not sure how that applies to DBI so 
> long as the interface between the database and Perl always encodes and 
> decodes then the issues Tom describes are all Perl ones - no?

The trouble is that:

1. They don't always encode or decode
2. When they do, the tend to get subtle bits wrong
3. And they all have different interfaces and philosophies for doing so

> Surely Oracle should return the data encoded as you asked for it and if it 
> did not Oracle is broken.
> I'd still like to see this case and then we can see if Oracle is broken and 
> if there is a fix for it.

Oh I don't doubt that Oracle is broken.

> In some places DBD::Oracle does sv_utf8_decode(scalar) or SvUTF8_on(scalar) 
> (depending on your Perl) and in some places it just does SvUTF8_on(scalar). I 
> believe the latter is much quicker as the data is not checked. Many people 
> (myself included) are particularly interested in DBD::Oracle being fast and 
> if all the occurrences were changed to decode I'd patch that out in my copy 
> as I know the data I receive is UTF-8 encoded.

IME It needs an "assume Oracle is broken" knob. That is, I should have the 
option to enface encoding and decoding, rather than just flipping SvUTF8. And I 
think that such an interface should be standardized in the DBI along with 
detailed information for driver authors how how to "get it right."

> See above. I'd like the chance to go with speed and take the consequences 
> rather than go with slower but know incorrect UTF-8 is spotted.

And maybe that's the default. But I should be able to tell it to be pedantic 
when the data is known to be bad (see, for example data from an 
SQL_ASCII-encoded PostgreSQL database).

> I thought UTF-8 when used in Perl used the strict definition and utf-8 used 
> Perl's looser definition - see 
> http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8

That's right. So if I want to ensure that I'm getting strict encoding in my 
database, It needs to encode and decode, not simply flip SvUTF8.

> Don't DBDs do this now? I know the encoding of the data I receive in 
> DBD::ODBC and decode it when I get it and encode it when I send it and I 
> believe that is what DBD::Oracle does as well. There is one exception in ODBC 
> for drivers which don't truly abide by ODBC spec and send 8 bit data back 
> UTF-8 encoded (see later).

There is no single API for configuring this in the DBI, and I argue there 
should be.

> I've spent a lot of effort getting unicode working in DBD::ODBC (for UNIX and 
> with patches from Alexander Foken for Windows) which is implemented in an 
> awkward fashion in ODBC. I'd like to hear from DBD authors what support they 
> already have and how it is implemented so we can see what ground is already 
> covered and where the problems were.

DBD::Pg's approach is currently broken. Greg is working on fixing it, but for 
compatibility reasons the fix is non-trivial (an the API might be, too). In a 
perfect world DBD::Pg would just always do the right thing, as the database 
tells it what encodings to use when you connect (and *all* data is encoded as 
such, not just certain data types). But the world is not perfect, there's a lot 
of legacy stuff.

Greg, care to add any other details?

> as I remain unconvinced a problem exists other than incorrectly coded DBDs. 
> I'm happy to collate that information. As a start I'll describe the DBD::ODBC:
> 
> 1. ODBC has 2 sets of APIs, SQLxxxA (each chr is 8 bits) and SQLxxxW (each 
> chr is 16 bits and UCS-2). This is how Microsoft did it and yes I know that 
> does not support all of unicode but code pages get involved too.
> 
> 2. You select which API you are using with a macro when you compile your 
> application so you cannot change your mind.
> You can in theory call SQLxxxA or SQLxxxW functions directly but if you use 
> SQLxxx you get the A or W depending on what the macro is set to.
> Problem: DBD::ODBC has to built one way or the other.
> 
> 3. When using the SQLxxxA functions you can still bind columns/parameters as 
> wide characters but the ODBC driver needs to support this.
> 
> 4. When using SQLxxxW functions all strings are expected in UCS-2. You can 
> bind columns and parameters as whatever type you like but obviously if you 
> bind a unicode column as SQLCHAR instead of SQLWCHAR you probably get the 
> wrong data back.
> 
> 5. If you write a unicode ODBC application using SQLxxxW functions and you 
> connect to an ODBC driver which does not support SQLxxxW functions the ODBC 
> driver manager drops the high byte and passes the data to the drivers SQLxxxA 
> function and similarly the other way around if the app uses SQLxxxA fns but 
> the driver supports SQLxxxW fns the ODBC driver manager converts the 8 bit 
> data to UCS-2 and calls the SQLxxxW fns in the driver. The driver manager 
> does not get involved in conversions on bound column or parameter data.

Wow. I'd want an exception in such a situation, personally.

> Problem: the non-windows driver managers have implemented unicode support in 
> this respect differently and I do not support them all when built for unicode.
> unixODBC does what Microsoft does - supported. 
> http://search.cpan.org/~mjevans/DBD-ODBC-1.31/ODBC.pm#Unicode describes what 
> unixODBC does and how you can avoid some conversions.
> iODBC uses wchar_t types (not supported).
> DataDirect ODBC driver manager uses UTF-8 (which has other problems in ODBC - 
> see http://www.martin-evans.me.uk/node/20#unicode) and is unsupported when 
> built for unicode but because this driver uses UTF-8 you can use unicode when 
> DBD::ODBC is not built for unicode.
> 
> 6. usernames, passwords, column names, table names, SQL, DSNs etc can all be 
> unicode.
> Problem: DBI did not used to support supplying a scalar to the connect method 
> - it was a char *. I changed DBI ages ago (1.607) to add another connect 
> method to pass scalar username, password, DSN string to support unicode in 
> connect parameters.
> Fortunately the prepare method already passed a scalar.
> I don't think there are problems with the other DBI methods as I think they 
> all use scalars and not char *.

Excellent. Musta been a bitch to track that down.

> 7. When DBD::ODBC is built for unicode (the default on Windows and an option 
> on other platforms but not the default) DBD::ODBC has to convert all that 
> UCS-2 data to UTF-8 and vice versa.
> Problem: this is time consuming so the code to do this is included in 
> DBD::ODBC and written in C using either Windows API calls for Windows and 
> code included with DBD::ODBC for other platforms.

Yeah, that's reasonable.

> 8. DBD::ODBC does not support unicode parameter names. You cannot do:
> 
> select * from table where column = :unicode_param_name
> 
> 9. you need at least Perl 5.8.1 as it is fundamentally broken before then.
> 
> 10. When using the wide APIs in ODBC some drivers will report different 
> column types for columns. For instances when listing tables, columns etc some 
> drivers (e.g., Microsoft SQL Server) will report the column types as wide 
> types even if the strings actually fit in 7-bit ASCII.
> 
> 11. Some ODBC drivers do not support the SQLxxxW APIs but can return data 
> UTF-8 encoded (e.g., PostgreSQL). DBD::ODBC has no idea the data is UTF-8 
> encoded so someone sent me a patch to add the odbc_utf8_on parameter which 
> tells DBD::ODBC all data is UTF-8 encoded.

Yeah, that's the kind of interface I'm talking about. DBD::Pg calls a similar 
(but flawed) parameter pg_enable_utf8. DBD::SQLite calls it sqlite_unicode.

> 12. To signal to Perl data is UTF-8 DBD::ODBC does:
> 
> #ifdef sv_utf8_decode
>                    sv_utf8_decode(sv);
> #else
>                    SvUTF8_on(sv);
> #endif
> 
> 
> I believe the former does check the encoding, the latter does not but not all 
> Perls have the former.

Makes sense. Would be nice to have the option to turn off the first item for 
performance reasons (see note from Lyle in this thread).

> I use DBD::Oracle and DBD::ODBC a lot and I know you can send and receive 
> unicode data reliably because both those DBDs decode/encode data from/to the 
> database. In fact, for Oracle we have masses of unicode data here in all 
> sorts of languages including Chinese and it is working fine.

Yeah, it works fine, I think we have some decent 80-90% solutions. It's the 
lack of consistency in the interface and rules between DBDs that's unfortunate.

> I guess I just don't see what is broken in DBI as it stands although I'm not 
> ruling out it might be able to do a little more to help a DBD.

It's not broken. The problem is every driver does it differently and none of 
them does it quite right. There need to be some standards on this, IMHO. That's 
why I was suggesting interface additions to the DBI in my OP. If you don't 
accept that such things should be standardized, that's an discussion we can 
have. The rest is implementation details.

Best,

David

Re: Add Unicode Support to the DBI

Reply via email to