On 27/10/2013 14:49, Tim Bunce wrote:

Thanks to Tim for being (as usual) a good sounding board.

On Sun, Oct 27, 2013 at 12:18:53PM +0000, Martin J. Evans wrote:

Inserting a unicode euro, utf8 flag on:
   input string: €
   data_string_desc of input string: UTF8 on, non-ASCII, 1 characters
3 bytes
   ords of input string: 20ac,
   bytes of input string: e2,82,ac,
   database length: 3
   data_string_desc of output string: UTF8 on, non-ASCII, 3
characters 7 bytes
   ords of output string:e2,201a,ac,
# what happened here is SQLDecribeParam described the parameter as
SQL_CHAR and that is what D:O bound it as. The driver converted 82
in code page 1252 to 201a (a magic quote) and the database sees this
as 3 chrs
# the probably mistake here is that D:O should have looked at the
perl data, seen it was unicode and not bound it as SQL_CHAR (but
SQL_WCHAR) even though SQLDescribeParam said it was

Agreed.

Good. That is also easily done assuming I check params are rebound in case someone tries to insert code page chrs followed by unicode.

Inserting a UTF-8 encoded unicode euro, utf8 flag off:
"\x{0082}" does not map to cp1252 at test_code.pl line 36.

I'd caution against using phases like "UTF-8 encoded unicode euro, utf8
flag off". From the application's perspective it's not a euro, it's just
a sequence of bytes (that just happens to match what a euro unicode
codepoint would look like when UTF-8 encoded).

Yeah, I realised that, I only threw that one in because I've come across people doing that i.e., encoding unicode and inserting the encoded octets, and I wanted to cover all bases.

To put it another way, if the application has that string of bytes and
thinks it's a euro then the application is almost certainly broken.

see above.

Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR:
...
# the above is arguably what should have happened in the first test
case i.e., D:O should have bound as SQL_WVARCHAR because the perl
data was unicode. I included this one to show you what happens if
you do it right.

Agreed.

So, I'm thinking the problem above is D:O ignores utf8 flag on
parameters when they are bound and uses whatever SQLDescribeParam
says instead (SQL_CHAR in case of varchar columns). If it didn't
ignore the utf8 flag here, it would have to rebind on every execute
(which it may already do, I didn't check).

Agreed.

Great. Progress! :)

So, the next question is what are the implications of fixing it for
existing applications? Do you need a deprecation cycle with warnings etc?

Tim.


Now we've got to this point we /can/ consider what happens when you read the data back. By default, in a unicode build of DBD::ODBC ALL string data is bound as SQL_WCHARs and I'm of the opinion that this is right and should continue (apart from anything else the only alternative is to bind as SQL_CHAR when the column is char/varchar and this doesn't help at all as the server will have to map server codepage chrs to client codepage chrs and this cannot always work). So let's consider what happens for each example:

Inserting a unicode euro, utf8 flag on:
  input string: €
data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes
  ords of input string: 20ac,
  bytes of input string: e2,82,ac,
  database length: 3
data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes
  ords of output string:e2,201a,ac,

So, here you didn't get back what you put in and the database didn't see the inserted data as a single euro chr but as 3 individual chrs in the client codepage which was mapped to the column codepage. When read back it mapped from the column code page to unicode as the select bound chrs as SQL_WCHAR. I'm struggling to see the value to anyone of this behaviour but I'd love to hear from you if you don't agree.

Inserting a UTF-8 encoded unicode euro, utf8 flag off:
  In other words a stream of octets
"\x{0082}" does not map to cp1252 at test_code.pl line 36.
  input string: â\x{0082}¬
data_string_desc of input string: UTF8 off, non-ASCII, 3 characters 3 bytes
  ords of input string: e2,82,ac,
  bytes of input string: e2,82,ac,
  database length: 3
data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes
  ords of output string:e2,201a,ac,

This is just the same as the first example and as such I cannot see the usefulness of it. However, because I know this I am at a loss as to why no one has told me this before I discovered it for myself. I've spoken to a few people actively using DBD::ODBC on Windows and mostly they are not using unicode for inserts/updates. If they restrict their inserts to ASCII there will be no change for them as what they get back now is already unicode (just the first 127 chrs of unicode) because DBD::ODBC always binds as SQL_WCHAR. If you insert chrs in the upper part of your codepage then what you get back is not what you inserted i.e., it is not chrs in your codepage, it is unicode and the database will have provided the mapping.

As it turns out I was about to release a 1.45 official release at the end of the 1.44 development cycle. I plan to release this any day now before considering any of this. My suggestion is that at the start of the 1.46_xx dev cycle:

o I apply the switch as described, advertise it widely, and hound everyone I know to try it out. As I see the current behaviour as broken and badly I just cannot see right now how anyone could have used it as it is and be adversely affected when it is fixed.

o add test cases for round tripping of unicode data to varchars (of which there is none right now as it is all to nvarchar which already works as discussed).

o run through the same tests for nvarchar/nchar just to be sure.

If, in the mean time someone comes up with a situation I've not thought of we could be back to the issue of a deprecation cycle.

I'd be in no rush to release a non-dev release of D::O and even if some bad bug came up I can do an new official release with only the fix for that in.

So, the only remaining issues are:

1) my belief that binding output data as unicode/wide_chars always in the unicode build is the right one. If you see any problem in that it could turn things up-side-down but it would be useful to know.

2) if someone specifies a bind type on bind_param I follow that no matter what?

3) SQLDescribeParam is not always supported (e.g. freeTDS) and even if it is, it can fail (because rearranging the SQL into a select which most drivers do for SQLDescribeParam can fail). In this case D:O has a fallback binding which you can override. I suggest that unless it is overridden D:O looks at the input param and binds it as unicode if it is unicode (SvUTF8), otherwise it binds it as SQL_CHAR. In other words, the result from SQLDescribeParam call becomes irrelevant for char data as we look at the parameter data first.

4) check that D:O rebinds parameters per execute in case the parameter has changed - I'm fairly sure it already does this. I will build this into the test cases.

If I follow this the changes in D:O are minimal and I've already made then here without any issues so far.

Opinions?

Martin
--
Martin J. Evans
Wetherby, UK

Reply via email to