On 10/05/2011 09:08, Martin J. Evans wrote:
On 09/05/11 22:06, Tim Bunce wrote:
On Mon, May 09, 2011 at 07:42:53PM +0100, Martin J. Evans wrote:
I've recently had an rt posted
(http://rt.cpan.org/Public/Bug/Display.html?id=67994) after a
discussion on stackoverflow
(http://stackoverflow.com/questions/5912082/automatic-character-encoding-handling-in-perl-dbi-dbdodbc).
In this case the Perl script is binding the columns but the data
returned is windows-1252 and the user is having to manually
Encode::decode all bound columns. DBD::ODBC already had a
odbc_utf8_on flag
(http://search.cpan.org/~mjevans/DBD-ODBC-1.29/ODBC.pm#odbc_utf8_on)
for a derivative of Postgres which returns bound data UTF-8 encoded
but in that case I can just call sv_utf8_decode (in the XS) and it
is converted in place. Initially I thought I could combine
odbc_utf8_on into a new flag saying my data is returned as xxx and
just call Encode::decode with xxx (then eventually I could drop
odbc_utf8_on).
I'd be wary of going down this path. I sense pain just beyond the
horizon.
A twisty-turny maze of sharp edge cases and unforseen issues.
So do I which is why I am still thinking about it and why I value any
other opinions/ideas.
For a start: What about the charset of bind values? What about the
charset of SQL literals?
Both of those are inputs but this case is data retrieved from MS SQL
Server. I see this as different. You can ensure your Perl data is
inserted into a unicode aware database via ODBC by ensuring your data
is encoded as unicode - DBD::ODBC will spot that and bind the data as
SQL_WCHAR or pass the SQL to SQLPrepareW/SQLExecDirectW (after
converting it to UCS2). The opposite for retrieving data is not true
unless you specifically bind all char types as SQL_WCHAR as this is
not the default in DBD::ODBC for varchar columns (just nvarchar
columns). Perhaps your thinking now this is where the flaw really is
but it is more difficult than that. DBD::ODBC asks the database what
the column type is and if it says it is an SQL_CHAR is is bound as an
SQL_CHAR (retrieving bytes in whatever encoding the database and
client libs decide) and if it says it is an SQL_WCHAR it is bound as
an SQL_WCHAR (and returned as UCS2 which DBD::ODBC happily converts
for you). In this guy's case, the column is identified as SQL_CHAR, it
is bound as SQL_CHAR and the returned data is windows-1252 and
DBD::ODBC knows nothing about it. I appreciate his problem to a degree
as it is a pain to have to decode every bound column.
Of course, I could change DBD::ODBC when built in unicode mode to bind
all SQL_CHAR columns as SQL_WCHAR and that would also solve his
problem but a) it would be more expensive for those who don't need to
do that b) I'd worry it might break existing code e.g., anyone like
him who is already decoding the data themselves.
Can't the database connection/session settings be altered to assume utf8
at the client end and have the server or client libs automatically
convert for you? If so, that's a good way to go.
Tim.
Perhaps although at this stage I don't know how and in any case even
if they could be returned utf-8 encoded I would need to know they are
utf-8 encoded to do anything with them. The additional problem is that
UTF-8 encoded data may be longer that the size of the column as
reported to DBD::ODBC so it would need to know this in advance and
multiple all buffers by 4 or perhaps even 6 (can't remember right now
if unicode goes that far).
Martin
Thinking about this more today and discussing it some of my colleagues I
think I've come to a conclusion.
The points that sway me are:
o by default DBD::ODBC on Windows is built to support the Wide ODBC API
which supports Unicode (not every unicode chr because it is UCS2 but most).
o if you are using the unicode build of DBD::ODBC you probably want to
support unicode data.
o you are obviously doing something with any data you retrieve from the
database and even if the encoding you retrieve from the database (or
chrset) is not what you finally need (say as the person who rt'ed
originally) you can easily add an encoding layer to stdout or the file
etc you output to.
o returning the data in some native codepage like windows-1252 does not
help you really, it probably hinders you and in any case you can encode
it as windows-1252 afterwards when you output it somewhere.
As a result, I think it was probably an omission when DBD::ODBC was
changed to support Unicode that it continued to retrieve varchar columns
as SQL_CHAR and it should retrieve them as SQL_WCHAR (so long as the
driver supports this - there is a SQLGetInfo call to check). In this way
the data retrieved to a Perl script should always be accurate and can
easily be translated to other charsets and encodings if that is what you
need. The hopefully slight downside is that anyone running code with
DBD::ODBC as it stands and like the person who started this on rt, if
they know the data was coming back as windows-1252 (or whatever) and
attempt to decode it, they will get the wrong data as it is already
unicode characters.
I propose to change DBD::ODBC to always bind varchar etc columns as
SQL_WCHAR and hence get unicode data in the unicode build but add a
connection attribute to return it to the previous behaviour. I could
have done it the other way around but this default seems more logical to me.
If you have any comments before I make this change I'd be happy to hear
them.
Martin