Nice one Mike spot on!

I did a bit of digging myself before trying anything and found this blog : 
http://blogs.msdn.com/b/qingsongyao/archive/2009/04/10/sql-server-and-utf-8-encoding-1-true-or-false.aspx

You learn something new every day!

It seems MS encoding is 'UCS-2' which is equivalent to 'UTF-16' (sort of, but 
not exact!)

Perhaps Win32::ODBC handles this natively whereas DBI::DBD::ODBC doesn't?

Having read the UCS-2 info at the bottom of the DBD::ODBC CPAN docs it seems 
DBD::ODBC is messy as hell trying to support UCS-2 / UTF-16.

I have tried within Perl to use decode against the string from the DBI fetch, 
but it just errors...

[error] Caught exception in My_App::Controller:: XLSExport->export "Wide 
character in subroutine entry at C:/Perl/lib/Encode.pm line 176."

And that was using UTF-16 and UCS-2 / UCS-2BE encoding types, is seems as per 
the encode docs "Bogus surrogates result in death."

And I guess as utf8::is_utf8 comes back as true and the fact it states on the 
DBD::ODBC docs...

" Perl scalars which are UTF-8 and are sent through the ODBC API will be 
converted to UTF-16 and passed to the ODBC wide APIs or signalled as SQL_WCHARs 
(e.g., in the case of bound columns). Retrieved data which are wide characters 
are converted from UTF-16 to UTF-8. However, you should realise most ODBC 
drivers do not support UTF-16"

It seems the DBD::ODBC driver is auto converting the data from UTF-16 to UTF-8, 
and that is where the corruption must be occurring, it even states that UTF-16 
isn't really supported under ODBC?

So what I have is UTF-8, but after a conversion by the ODBC driver from UCS-2 
using UTF-16 encoding, which appears to be making a mess of things.

I have checked and I was running DBI v 1.616, so installed latest version 1.627 
plus upgraded DBD::ODBC to 1.43 but exactly the same problem!

I'm at a loss what else I can do other than throw DBI away and refactor to 
Win32::ODBC , it seems DBI::DBD::ODBC is converting the data on retrieval into 
UTF-8 and making a pigs ear of it! I can't see any simple switch or parameter 
to tell DBD::ODBC to do things differently, can you?

There is a whole bunch of Unicode maybe's and caveats on the DBD::ODBC CPAN 
docs, yet if you scan the Win32::ODBC CPAN docs for a mention of UTF-8 / UTF-16 
/ UCS-2 or the word Unicode, none of them are mentioned even once, so is that a 
significant indicator of something?

I've checked with DBD::ODBC and Win32::ODBC , that the ODBC.pm is not the same 
module being used, I can see a flag in the DBI ODBC.pm module of 
"odbc_has_unicode", but it is an read only  attribute.

Can you suggest anything else that might prevent DBI from corrupting the data 
on UTF-8 conversion?

Thanks,
Craig.








-----Original Message-----
From: Mike Whitaker [mailto:m...@altrion.org]
Sent: 04 July 2013 10:35
To: The elegant MVC web framework
Subject: Re: [Catalyst] CSV / UTF-8 / Unicode

On 4 Jul 2013, at 10:30, Mike Whitaker <m...@altrion.org> wrote:
> On 4 Jul 2013, at 09:56, Craig Chant <cr...@homeloanpartnership.com> wrote:
>>
>> Yes it's NVARCHAR(max) , which I understood is MS's data-typing for uNicode 
>> VARiable CHARacters, looking at some sample column data via the Windows SQL 
>> Management GUI, it appears to display ok.
>
> It probably isn't UTF-8, though. UTF-8 is only one possible encoding in which 
> you can store unicode character points.

Try, on the off-chance I've read the spec right

- not bothering with mysql_encode_utf8 in the DBI connect args
- passing all data from DBI through decode("UTF-16",...) or decode("UCS-2"...) 
- MS's docs aren't that clear which!
- reencoding it as utf8 on the way out.



_______________________________________________
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/
This Email and any attachments contain confidential information and is intended 
solely for the individual to whom it is addressed. If this Email has been 
misdirected, please notify the author as soon as possible. If you are not the 
intended recipient you must not disclose, distribute, copy, print or rely on 
any of the information contained, and all copies must be deleted immediately. 
Whilst we take reasonable steps to try to identify any software viruses, any 
attachments to this e-mail may nevertheless contain viruses, which our 
anti-virus software has failed to identify. You should therefore carry out your 
own anti-virus checks before opening any documents. HomeLoan Partnership will 
not accept any liability for damage caused by computer viruses emanating from 
any attachment or other document supplied with this e-mail. HomeLoan 
Partnership reserves the right to monitor and archive all e-mail communications 
through its network. No representative or employee of HomeLoan Partnership has 
the authority to enter into any contract on behalf of HomeLoan Partnership by 
email. HomeLoan Partnership is a trading name of H L Partnership Limited, 
registered in England and Wales with Registration Number 5011722. Registered 
office: 26-34 Old Street, London, EC1V 9QQ. H L Partnership Limited is 
authorised and regulated by the Financial Conduct Authority.

_______________________________________________
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/

Reply via email to