Re: DBD::ODBC tricky unicode problem

Tim Bunce Sat, 26 Oct 2013 14:56:37 -0700

On Thu, Oct 24, 2013 at 12:00:27AM +0100, Martin J. Evans wrote:

> I may interchange bytes with chrs in a codepage in this.
> A codepage has 255 characters and each one is a byte.


I don't understand that first sentance.

> Before unicode support was added to DBD::ODBC, it used to bind all
> char and varchar columns as SQL_CHAR (and hence not unicode BUT
> could be a character in a windows code page which has a unique
> unicode codepoint). This meant you could store chrs in your current
> code page and it also meant you could just ignore code pages and
> store, say UTF-8 encoded data in a char/varchar (and some did and
> decoded it themselves when they read it back).

There's a whole bunch of stuff in there that I'm not clear on.

By "your current code page" I presume you mean the client current code
page, not the server's idea of what the client code page is.

What I think you're saying here is that client code could store a
sequence of bytes on the server and get back the same bytes later,
*but* the server may interpret the sequence of bytes differently.

I find it really helpful to be very clear about how each part of
a system views the bytes that are passing through it:

    - terminal
    - application reading input
    - application internal logic
    - DBD
    - DB client library
    - DB server
    ---later---
    - DB server
    - DB client library
    - DBD
    - application internal logic
    - etc. etc.

There are many actors here. If they have different views of what the
byte stream represents then you can have problems. But those problems
might only appear in certain cases (like surrogate characters).
Also different views at multiple different levels can cancel each other
out, making detection of problems difficult.

It's quite possible to have correct unicode input at a terminal, that's
then processed by some code and output as correct unicode, while the code
that did the processing doesn't view the bytes as unicode.

The DBD::Oracle test suite goes to great lengths to check that
the servers view of what it's been sent matches what we want.
Specifically that when a single unicode character which encodes to
multiple bytes gets to the server it agrees that it's only a single
character.

Forgive me if all this is familiar to you. I repeat it here partly for
others and partly because it wasn't clear to me from how you'd expressed
the above.

> When I discovered if you bound char/varchar columns as SQL_WCHAR
> (unicode) (with a unicode build of DBD::ODBC) you got unicode back
> (in other words sql server converts chrs in the current codepage to
> unicode) it seemed like a good idea to change the default to bind as
> SQL_WCHAR. With hindsight that was a mistake. Because this would
> break people storing UTF-8 encoded data in char/varchar I issued a
> change warning, eventually changed to SQL_WCHAR and added an
> odbc_old_unicode attribute to return it to old behaviour. No one has
> complained so far.

I read this and I'm left wondering about input vs storage vs output.
It's not clear to me what is the servers view of what it's been sent.
What would selectrow_array("SELECT CHARACTER_LENGTH(?) ...", undef, $foo)
return?

Or, to look at it another way, if the value is inserted into a table
and the table is sorted, is it sorted correctly? (That's a more complex
question though because it involves collations and possibly per-table
or even per-column characters sets - so even more actors in the plot!)

> As it turns out this was probably a bad idea and breaks things.
> 
> The problem I've created is horrible.
> 
> Let's say you want to store chrs in the current code page in a char
> or varchar or you want to store data already UTF-8 encoded in a
> char/varchar (i.e. bytes) and decode it when you read it back.

Argh. Too complicated! :)

> Let's say the data you want to store is either:
> 
> codepage chrs: 0xe2, 0x82, 0xc2
> UTF-8 encoded euro symbol: 0xe2, 0x82, 0xc2

What does perl's length() function report for those?
It's really important to focus on surface representation
when trying to work out what the right thing to do is.
I presume the perl length() of the first string is 3
and the second is 1.  Right?

> When you insert it, DBD::ODBC calls SQLDescribeParam and the driver
> describes these parameters as SQL_CHAR, they are bound as SQL_CHAR,
> so SQL Server takes this sequence to be characters in the current
> (or table or column) codepage.

Are you saying that DBD::ODBC doesn't pay attention to the SvUTF8
flag on the value?

> If you select them back binding them
> as SQL_CHAR, you get back what you put in - excellent. Except,
> DBD::ODBC now (and has for a long time) defaulted in a unicode build
> of DBD::ODBC to binding them as SQL_WCHAR. When you read them back
> you get:
> 
> 0xe2,0x201a,0xac (assuming windows-1252 is the codepage)
> 
> (because 0x82 in windows-1252 is a funny comma which is really U+201a)

What did the server think the string was?

> and the column data is marked in perl as unicode.

So a unicode build of DBD::ODBC always sets the SvUTF8 flag?

> argh.
> 
> I don't really care about the people trying to insert unicode into
> char/varchars as it is just a daft thing to do (although possible)
> and my document explains why but I'm amazed this has not caught
> someone out on Windows (where the default for DBD::ODBC is a unicode
> build).

I wouldn't call it daft. People have database tables with varchar
fields and perl strings they want to store in them. They expect it to
just work, and keep on "just working" when they start processing
unicode - which might happen accidentally.

> So now I'm not sure what to do. It seems the odbc_old_unicode
> behaviour was right in the first place and although it is easy to
> fix how do I do it reasonably? Can I assume anyone who got broken
> when the original change was made, switched to setting
> odbc_old_unicode so reverting to old behaviour won't affect them?
> But then, what about people not using odbc_old_unicode and relying
> on it. You could say these people still get back what they inserted,
> it is just they inserted data in a codepage and got the same string
> back, just in unicode now. They will be the ones affected because
> the data they would get back after reverting the change will be
> bytes/chrs in the codepage now and no longer marked as unicode in
> their perl.
> 
> Any good suggestions to get me out of this mess?

Let's start by separating the discussion of input from output.

I suggest you rewrite the message and focus just in input for now.

Tim [who knows ~zero about DBD::ODBC].

Re: DBD::ODBC tricky unicode problem

Reply via email to