Re: DBD::ODBC tricky unicode problem

Martin J. Evans Sun, 27 Oct 2013 10:47:56 -0700

On 27/10/2013 14:49, Tim Bunce wrote:

Thanks to Tim for being (as usual) a good sounding board.

On Sun, Oct 27, 2013 at 12:18:53PM +0000, Martin J. Evans wrote:


Inserting a unicode euro, utf8 flag on:
   input string: €
   data_string_desc of input string: UTF8 on, non-ASCII, 1 characters
3 bytes
   ords of input string: 20ac,
   bytes of input string: e2,82,ac,
   database length: 3
   data_string_desc of output string: UTF8 on, non-ASCII, 3
characters 7 bytes
   ords of output string:e2,201a,ac,
# what happened here is SQLDecribeParam described the parameter as
SQL_CHAR and that is what D:O bound it as. The driver converted 82
in code page 1252 to 201a (a magic quote) and the database sees this
as 3 chrs
# the probably mistake here is that D:O should have looked at the
perl data, seen it was unicode and not bound it as SQL_CHAR (but
SQL_WCHAR) even though SQLDescribeParam said it was


Agreed.

Good. That is also easily done assuming I check params are rebound incase someone tries to insert code page chrs followed by unicode.

Inserting a UTF-8 encoded unicode euro, utf8 flag off:
"\x{0082}" does not map to cp1252 at test_code.pl line 36.


I'd caution against using phases like "UTF-8 encoded unicode euro, utf8
flag off". From the application's perspective it's not a euro, it's just
a sequence of bytes (that just happens to match what a euro unicode
codepoint would look like when UTF-8 encoded).

Yeah, I realised that, I only threw that one in because I've come acrosspeople doing that i.e., encoding unicode and inserting the encodedoctets, and I wanted to cover all bases.

To put it another way, if the application has that string of bytes and
thinks it's a euro then the application is almost certainly broken.


see above.

Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR:
...
# the above is arguably what should have happened in the first test
case i.e., D:O should have bound as SQL_WVARCHAR because the perl
data was unicode. I included this one to show you what happens if
you do it right.


Agreed.

So, I'm thinking the problem above is D:O ignores utf8 flag on
parameters when they are bound and uses whatever SQLDescribeParam
says instead (SQL_CHAR in case of varchar columns). If it didn't
ignore the utf8 flag here, it would have to rebind on every execute
(which it may already do, I didn't check).


Agreed.

Great. Progress! :)

So, the next question is what are the implications of fixing it for
existing applications? Do you need a deprecation cycle with warnings etc?

Tim.

Now we've got to this point we /can/ consider what happens when you readthe data back. By default, in a unicode build of DBD::ODBC ALL stringdata is bound as SQL_WCHARs and I'm of the opinion that this is rightand should continue (apart from anything else the only alternative is tobind as SQL_CHAR when the column is char/varchar and this doesn't helpat all as the server will have to map server codepage chrs to clientcodepage chrs and this cannot always work). So let's consider whathappens for each example:


Inserting a unicode euro, utf8 flag on:
  input string: €

data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3bytes

  ords of input string: 20ac,
  bytes of input string: e2,82,ac,
  database length: 3

data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7bytes

  ords of output string:e2,201a,ac,

So, here you didn't get back what you put in and the database didn't seethe inserted data as a single euro chr but as 3 individual chrs in theclient codepage which was mapped to the column codepage. When read backit mapped from the column code page to unicode as the select bound chrsas SQL_WCHAR. I'm struggling to see the value to anyone of thisbehaviour but I'd love to hear from you if you don't agree.


Inserting a UTF-8 encoded unicode euro, utf8 flag off:
  In other words a stream of octets
"\x{0082}" does not map to cp1252 at test_code.pl line 36.
  input string: â\x{0082}¬

data_string_desc of input string: UTF8 off, non-ASCII, 3 characters 3bytes

  ords of input string: e2,82,ac,
  bytes of input string: e2,82,ac,
  database length: 3

data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7bytes

  ords of output string:e2,201a,ac,

This is just the same as the first example and as such I cannot see theusefulness of it. However, because I know this I am at a loss as to whyno one has told me this before I discovered it for myself. I've spokento a few people actively using DBD::ODBC on Windows and mostly they arenot using unicode for inserts/updates. If they restrict their inserts toASCII there will be no change for them as what they get back now isalready unicode (just the first 127 chrs of unicode) because DBD::ODBCalways binds as SQL_WCHAR. If you insert chrs in the upper part of yourcodepage then what you get back is not what you inserted i.e., it is notchrs in your codepage, it is unicode and the database will have providedthe mapping.

As it turns out I was about to release a 1.45 official release at theend of the 1.44 development cycle. I plan to release this any day nowbefore considering any of this. My suggestion is that at the start ofthe 1.46_xx dev cycle:

o I apply the switch as described, advertise it widely, and houndeveryone I know to try it out. As I see the current behaviour as brokenand badly I just cannot see right now how anyone could have used it asit is and be adversely affected when it is fixed.

o add test cases for round tripping of unicode data to varchars (ofwhich there is none right now as it is all to nvarchar which alreadyworks as discussed).


o run through the same tests for nvarchar/nchar just to be sure.

If, in the mean time someone comes up with a situation I've not thoughtof we could be back to the issue of a deprecation cycle.

I'd be in no rush to release a non-dev release of D::O and even if somebad bug came up I can do an new official release with only the fix forthat in.


So, the only remaining issues are:

1) my belief that binding output data as unicode/wide_chars always inthe unicode build is the right one. If you see any problem in that itcould turn things up-side-down but it would be useful to know.

2) if someone specifies a bind type on bind_param I follow that nomatter what?

3) SQLDescribeParam is not always supported (e.g. freeTDS) and even ifit is, it can fail (because rearranging the SQL into a select which mostdrivers do for SQLDescribeParam can fail). In this case D:O has afallback binding which you can override. I suggest that unless it isoverridden D:O looks at the input param and binds it as unicode if it isunicode (SvUTF8), otherwise it binds it as SQL_CHAR. In other words, theresult from SQLDescribeParam call becomes irrelevant for char data as welook at the parameter data first.

4) check that D:O rebinds parameters per execute in case the parameterhas changed - I'm fairly sure it already does this. I will build thisinto the test cases.

If I follow this the changes in D:O are minimal and I've already madethen here without any issues so far.


Opinions?

Martin
--
Martin J. Evans
Wetherby, UK

Re: DBD::ODBC tricky unicode problem

Reply via email to