Re: Always setting UTF-8 flag - am I bad?

Nick Ing-Simmons Thu, 05 Aug 2004 02:04:21 -0700

Erland Sommarskog <[EMAIL PROTECTED]> writes:
>Jean-Michel Hiver ([EMAIL PROTECTED]) writes:
>> Erland Sommarskog wrote:
>>>I working with an XS module that passes queries to MS SQL Server and
>>>returns data back using SQLOLEDB. MS SQL Server stores Unicode data
>>>as UTF-16. Also, all metadata is UTF-16.
>>>
>>>Currently when I get Unicode data back from SQL Server, I convert it to
>>>UTF-8, stash it in an SV, and then set the UTF-8 flag, without checking
>>>whether this is really necessary.


That should be okay. A reasonably cheap option is to convert to UTF-8
as above, then scan so see if any of high bits are set and only 
set SvUTF8_on if they occur. That way pure ASCII isn't "penalized" 
by having UTF-8 bit set. 
Doing a convert to iso-8859-1 is the alternative, but note that 
NOT setting UTF-8 flag on high chars (even if representable)
affects (sadly) the semantics. So unless "locale" is used
(which is a bit alien to Win32) 'Ã' (N with tilde) etc. are not alpha
as perl defaults to C locale.

Note too that normal Windows "latin 1" code page is a superset 
of iso-8859-1 - so converting to that is wrong, as it will encode 
Euro, smart quotes and m-dash etc. into places (0x80..0x9f) that are 
not what perl expects.

>>>
>> Personally I try to use Encode as much as possible which does The Right 
>> Thing for me.
>> 
>> $string = Encode::decode ('utf-16', $octets); is pretty safe.

As far as I recall Encode::decode leaves the SvUTF8 flag on once it 
has done its thing. But Dan may have cleaned that up.

>> 
>> Regarding to speed, Encode seems pretty fast to me - but YMMV I guess.
>
>Alright, I failed to say that this is an XS module, so I convert with
>WideCharToMultiByte, a Windows routine(*), put the result in an SV, and
>then say SvUTF8_on.

The possible danger here is if the "multi byte" encoding for 
user's environment is not UTF-8 but (say) a Japanese one.
Using Encode avoids that.

>
>(*) SQLOLEDB is available on Windows only, so portability is not an issue.

Re: Always setting UTF-8 flag - am I bad?

Reply via email to