Re: Make Encode.pm support the real UTF-8

Bjoern Hoehrmann Thu, 02 Dec 2004 05:40:36 -0800

* Gisle Aas wrote:
>More interesting is:
>
>   decode("UTF8", "Bj\xEF\xBF\xBFrn")
>
>where "\xEF\xBF\xBF" is not legal UTF-8 because "\x{FFFF}" is not
>legal Unicode.  Either the whole sequence "\xEF\xBF\xBF" is replaced
>by "\x{FFFD}" or each bad byte is giving us
>"Bj\x{FFFD}\x{FFFD}\x{FFFD}rn".  I think the later will be more sane,
>especially when you hit on perl 64-bit extension to UTF-8..


I think it should do whatever comes closest to the requirements or
suggestions in Unicode or RFC 3629; I am not sure what that would be
though.

>> Now that we have this problem, introducing more places where one needs
>> to carefully check the documentation what is considered UTF-8 does not
>> seem like the best option, having decode_utf8() and decode(utf8=>...)
>> mean some- thing different is likely going to cause confusion. Maybe
>> this could go the other way round, i.e. introduce a new encoding
>> "UTF-8-Strict" or something.
>
>This is certainly more backwards compatible, but do we really want
>perl applications to exchange illegal UTF-8 by default?

Hmm, maybe I should ask why you proposed to keep the old behavior of
encode_utf8 in the first place? The change would make more sense to
me if both encode("UTF-8" => ...) and encode_utf8(...) were changed.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Make Encode.pm support the real UTF-8

Reply via email to