Re: DBD::ODBC tricky unicode problem

Tim Bunce Mon, 28 Oct 2013 05:42:57 -0700

On Sun, Oct 27, 2013 at 05:45:50PM +0000, Martin J. Evans wrote:
> On 27/10/2013 14:49, Tim Bunce wrote:
> 
> >So, the next question is what are the implications of fixing it for
> >existing applications? Do you need a deprecation cycle with warnings etc?


Might be worth nailing that doen before moving on to output issues.

> Now we've got to this point we /can/ consider what happens when you
> read the data back. By default, in a unicode build of DBD::ODBC ALL
> string data is bound as SQL_WCHARs and I'm of the opinion that this
> is right and should continue (apart from anything else the only
> alternative is to bind as SQL_CHAR when the column is char/varchar
> and this doesn't help at all as the server will have to map server
> codepage chrs to client codepage chrs and this cannot always work).
> So let's consider what happens for each example:
> 
> Inserting a unicode euro, utf8 flag on:

I wouldn't involve input issues when discussing output issues.
Input and output are separate issues best kept separate. Otherwise
there are two many variables and distractions. I'd suggest
constructing the string on the server using suitable SQL statements
and checking that that string is received by the app.

>   input string: €
>   data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes
>   ords of input string: 20ac,
>   bytes of input string: e2,82,ac,
>   database length: 3

"database length" is ambiguous, best to clarify as "database character length"
(or "database octet length").

>   data_string_desc of output string: UTF8 on, non-ASCII, 3
> characters 7 bytes
>   ords of output string:e2,201a,ac,
> 
> So, here you didn't get back what you put in and the database didn't
> see the inserted data as a single euro chr but as 3 individual chrs
> in the client codepage which was mapped to the column codepage. When
> read back it mapped from the column code page to unicode as the
> select bound chrs as SQL_WCHAR. I'm struggling to see the value to
> anyone of this behaviour but I'd love to hear from you if you don't
> agree.

Forget the input, focus on what the server had. The key question for
output is did the application get the same characters as as server.
In this case the server had 3 chars and when they reached the app
one of the chars was corrupted. It's broken, but with this test you
can't be sure where the breakage is (input or output) because you've not
reported what the server thought the three characters were.


>   database length: 3
>   data_string_desc of output string: UTF8 on, non-ASCII, 3
> characters 7 bytes
>   ords of output string:e2,201a,ac,
> 
> This is just the same as the first example and as such I cannot see
> the usefulness of it.

It's the same server state (3 chars) so the same output state.
Like I said, forget the mechanism when talking about the output mechanism.

[Sorry to keep banging the same drum!]

> As it turns out I was about to release a 1.45 official release at
> the end of the 1.44 development cycle. I plan to release this any
> day now before considering any of this.

Perhaps add a warning message to the README and docs.

> My suggestion is that at the start of the 1.46_xx dev cycle:
> 
> o I apply the switch as described, advertise it widely, and hound
> everyone I know to try it out. As I see the current behaviour as
> broken and badly I just cannot see right now how anyone could have
> used it as it is and be adversely affected when it is fixed.
> 
> o add test cases for round tripping of unicode data to varchars (of
> which there is none right now as it is all to nvarchar which already
> works as discussed).

"round tripping" will "just work" _if_ input works and output works.
But testing _only_ round tripping won't prove that the server has the
correct interpretation of what's been sent. It's prone to false
positives. Also failures won't tell you if the failure was due to the
input or output. So I suggest focusing on testing input and output
separately.

> So, the only remaining issues are:
> 
> 1) my belief that binding output data as unicode/wide_chars always
> in the unicode build is the right one. If you see any problem in
> that it could turn things up-side-down but it would be useful to
> know.

What about people wanting to fetch binary data, like images?
Presumably "binding output data as unicode/wide_chars always"
depends on the data type on the server: varchar/nvarchar/blob/clob.

> 2) if someone specifies a bind type on bind_param I follow that no
> matter what?

They'll get what they ask for :)

> 3) SQLDescribeParam is not always supported (e.g. freeTDS) and even
> if it is, it can fail (because rearranging the SQL into a select
> which most drivers do for SQLDescribeParam can fail). In this case
> D:O has a fallback binding which you can override. I suggest that
> unless it is overridden D:O looks at the input param and binds it as
> unicode if it is unicode (SvUTF8), otherwise it binds it as
> SQL_CHAR. In other words, the result from SQLDescribeParam call
> becomes irrelevant for char data as we look at the parameter data
> first.

Sounds good.

> 4) check that D:O rebinds parameters per execute in case the
> parameter has changed - I'm fairly sure it already does this. I will
> build this into the test cases.
> 
> If I follow this the changes in D:O are minimal and I've already
> made then here without any issues so far.
> 
> Opinions?

Did I mention to consider input and output separately?

:)

Tim.

Re: DBD::ODBC tricky unicode problem

Reply via email to