On 10/05/2011 09:08, Martin J. Evans wrote:
On 09/05/11 22:06, Tim Bunce wrote:
On Mon, May 09, 2011 at 07:42:53PM +0100, Martin J. Evans wrote:
I've recently had an rt posted
(http://rt.cpan.org/Public/Bug/Display.html?id=67994) after a
discussion on stackoverflow (http://stackoverflow.com/questions/5912082/automatic-character-encoding-handling-in-perl-dbi-dbdodbc).

In this case the Perl script is binding the columns but the data
returned is windows-1252 and the user is having to manually
Encode::decode all bound columns. DBD::ODBC already had a
odbc_utf8_on flag
(http://search.cpan.org/~mjevans/DBD-ODBC-1.29/ODBC.pm#odbc_utf8_on)
for a derivative of Postgres which returns bound data UTF-8 encoded
but in that case I can just call sv_utf8_decode (in the XS) and it
is converted in place. Initially I thought I could combine
odbc_utf8_on into a new flag saying my data is returned as xxx and
just call Encode::decode with xxx (then eventually I could drop
odbc_utf8_on).

I'd be wary of going down this path. I sense pain just beyond the horizon.
A twisty-turny maze of sharp edge cases and unforseen issues.

So do I which is why I am still thinking about it and why I value any other opinions/ideas.

For a start: What about the charset of bind values?  What about the
charset of SQL literals?

Both of those are inputs but this case is data retrieved from MS SQL Server. I see this as different. You can ensure your Perl data is inserted into a unicode aware database via ODBC by ensuring your data is encoded as unicode - DBD::ODBC will spot that and bind the data as SQL_WCHAR or pass the SQL to SQLPrepareW/SQLExecDirectW (after converting it to UCS2). The opposite for retrieving data is not true unless you specifically bind all char types as SQL_WCHAR as this is not the default in DBD::ODBC for varchar columns (just nvarchar columns). Perhaps your thinking now this is where the flaw really is but it is more difficult than that. DBD::ODBC asks the database what the column type is and if it says it is an SQL_CHAR is is bound as an SQL_CHAR (retrieving bytes in whatever encoding the database and client libs decide) and if it says it is an SQL_WCHAR it is bound as an SQL_WCHAR (and returned as UCS2 which DBD::ODBC happily converts for you). In this guy's case, the column is identified as SQL_CHAR, it is bound as SQL_CHAR and the returned data is windows-1252 and DBD::ODBC knows nothing about it. I appreciate his problem to a degree as it is a pain to have to decode every bound column.

Of course, I could change DBD::ODBC when built in unicode mode to bind all SQL_CHAR columns as SQL_WCHAR and that would also solve his problem but a) it would be more expensive for those who don't need to do that b) I'd worry it might break existing code e.g., anyone like him who is already decoding the data themselves.

Can't the database connection/session settings be altered to assume utf8
at the client end and have the server or client libs automatically
convert for you? If so, that's a good way to go.

Tim.

Perhaps although at this stage I don't know how and in any case even if they could be returned utf-8 encoded I would need to know they are utf-8 encoded to do anything with them. The additional problem is that UTF-8 encoded data may be longer that the size of the column as reported to DBD::ODBC so it would need to know this in advance and multiple all buffers by 4 or perhaps even 6 (can't remember right now if unicode goes that far).

Martin

Thinking about this more today and discussing it some of my colleagues I think I've come to a conclusion.

The points that sway me are:

o by default DBD::ODBC on Windows is built to support the Wide ODBC API which supports Unicode (not every unicode chr because it is UCS2 but most).

o if you are using the unicode build of DBD::ODBC you probably want to support unicode data.

o you are obviously doing something with any data you retrieve from the database and even if the encoding you retrieve from the database (or chrset) is not what you finally need (say as the person who rt'ed originally) you can easily add an encoding layer to stdout or the file etc you output to.

o returning the data in some native codepage like windows-1252 does not help you really, it probably hinders you and in any case you can encode it as windows-1252 afterwards when you output it somewhere.

As a result, I think it was probably an omission when DBD::ODBC was changed to support Unicode that it continued to retrieve varchar columns as SQL_CHAR and it should retrieve them as SQL_WCHAR (so long as the driver supports this - there is a SQLGetInfo call to check). In this way the data retrieved to a Perl script should always be accurate and can easily be translated to other charsets and encodings if that is what you need. The hopefully slight downside is that anyone running code with DBD::ODBC as it stands and like the person who started this on rt, if they know the data was coming back as windows-1252 (or whatever) and attempt to decode it, they will get the wrong data as it is already unicode characters.

I propose to change DBD::ODBC to always bind varchar etc columns as SQL_WCHAR and hence get unicode data in the unicode build but add a connection attribute to return it to the previous behaviour. I could have done it the other way around but this default seems more logical to me.

If you have any comments before I make this change I'd be happy to hear them.

Martin

Reply via email to