Re: Decoding data from the database in DBD

Martin J. Evans Tue, 10 May 2011 12:01:50 -0700

On 10/05/2011 09:08, Martin J. Evans wrote:

On 09/05/11 22:06, Tim Bunce wrote:
On Mon, May 09, 2011 at 07:42:53PM +0100, Martin J. Evans wrote:
I've recently had an rt posted
(http://rt.cpan.org/Public/Bug/Display.html?id=67994) after a
discussion on stackoverflow(http://stackoverflow.com/questions/5912082/automatic-character-encoding-handling-in-perl-dbi-dbdodbc).
In this case the Perl script is binding the columns but the data
returned is windows-1252 and the user is having to manually
Encode::decode all bound columns. DBD::ODBC already had a
odbc_utf8_on flag
(http://search.cpan.org/~mjevans/DBD-ODBC-1.29/ODBC.pm#odbc_utf8_on)
for a derivative of Postgres which returns bound data UTF-8 encoded
but in that case I can just call sv_utf8_decode (in the XS) and it
is converted in place. Initially I thought I could combine
odbc_utf8_on into a new flag saying my data is returned as xxx and
just call Encode::decode with xxx (then eventually I could drop
odbc_utf8_on).
I'd be wary of going down this path. I sense pain just beyond thehorizon.
A twisty-turny maze of sharp edge cases and unforseen issues.
So do I which is why I am still thinking about it and why I value anyother opinions/ideas.
For a start: What about the charset of bind values?  What about the
charset of SQL literals?
Both of those are inputs but this case is data retrieved from MS SQLServer. I see this as different. You can ensure your Perl data isinserted into a unicode aware database via ODBC by ensuring your datais encoded as unicode - DBD::ODBC will spot that and bind the data asSQL_WCHAR or pass the SQL to SQLPrepareW/SQLExecDirectW (afterconverting it to UCS2). The opposite for retrieving data is not trueunless you specifically bind all char types as SQL_WCHAR as this isnot the default in DBD::ODBC for varchar columns (just nvarcharcolumns). Perhaps your thinking now this is where the flaw really isbut it is more difficult than that. DBD::ODBC asks the database whatthe column type is and if it says it is an SQL_CHAR is is bound as anSQL_CHAR (retrieving bytes in whatever encoding the database andclient libs decide) and if it says it is an SQL_WCHAR it is bound asan SQL_WCHAR (and returned as UCS2 which DBD::ODBC happily convertsfor you). In this guy's case, the column is identified as SQL_CHAR, itis bound as SQL_CHAR and the returned data is windows-1252 andDBD::ODBC knows nothing about it. I appreciate his problem to a degreeas it is a pain to have to decode every bound column.
Of course, I could change DBD::ODBC when built in unicode mode to bindall SQL_CHAR columns as SQL_WCHAR and that would also solve hisproblem but a) it would be more expensive for those who don't need todo that b) I'd worry it might break existing code e.g., anyone likehim who is already decoding the data themselves.
Can't the database connection/session settings be altered to assume utf8
at the client end and have the server or client libs automatically
convert for you? If so, that's a good way to go.

Tim.
Perhaps although at this stage I don't know how and in any case evenif they could be returned utf-8 encoded I would need to know they areutf-8 encoded to do anything with them. The additional problem is thatUTF-8 encoded data may be longer that the size of the column asreported to DBD::ODBC so it would need to know this in advance andmultiple all buffers by 4 or perhaps even 6 (can't remember right nowif unicode goes that far).
Martin

Thinking about this more today and discussing it some of my colleagues Ithink I've come to a conclusion.


The points that sway me are:

o by default DBD::ODBC on Windows is built to support the Wide ODBC APIwhich supports Unicode (not every unicode chr because it is UCS2 but most).

o if you are using the unicode build of DBD::ODBC you probably want tosupport unicode data.

o you are obviously doing something with any data you retrieve from thedatabase and even if the encoding you retrieve from the database (orchrset) is not what you finally need (say as the person who rt'edoriginally) you can easily add an encoding layer to stdout or the fileetc you output to.

o returning the data in some native codepage like windows-1252 does nothelp you really, it probably hinders you and in any case you can encodeit as windows-1252 afterwards when you output it somewhere.

As a result, I think it was probably an omission when DBD::ODBC waschanged to support Unicode that it continued to retrieve varchar columnsas SQL_CHAR and it should retrieve them as SQL_WCHAR (so long as thedriver supports this - there is a SQLGetInfo call to check). In this waythe data retrieved to a Perl script should always be accurate and caneasily be translated to other charsets and encodings if that is what youneed. The hopefully slight downside is that anyone running code withDBD::ODBC as it stands and like the person who started this on rt, ifthey know the data was coming back as windows-1252 (or whatever) andattempt to decode it, they will get the wrong data as it is alreadyunicode characters.

I propose to change DBD::ODBC to always bind varchar etc columns asSQL_WCHAR and hence get unicode data in the unicode build but add aconnection attribute to return it to the previous behaviour. I couldhave done it the other way around but this default seems more logical to me.

If you have any comments before I make this change I'd be happy to hearthem.


Martin

Re: Decoding data from the database in DBD

Reply via email to