* Gisle Aas wrote: >As you probably know perl's version of UTF-8 is not the real thing. I >thought I would hack up a patch to support the encoding as defined by >Unicode. That involves rejecting illegal chars (like surrogates, >"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences >and such.
I would very much like to have this functionality available in some standard module. Though, what do you mean here by rejecting exactly? For example, by default, I would expect decode("UTF-8" => "Bj\xF6rn") to return "Bj\x{FFFD}rn" as documented in `perldoc Encode`; would this change (i.e., would it croak instead)? >Before I do this I would like to get some feedback on the interface. >My prefered interface would be to make: > > encode("UTF-8", $string) > >imply the official restricted form and then have > > encode("UTF-8-Perl", $string) > >be used as the name for Perl's relaxed and extended version of the >encoding. The encode_utf8($string) function would continue to be the >same as encode("UTF-8-Perl", $string). I would prefer there was no semantic overloading of "UTF-8" at all, I generally expect that anything called UTF-8 refers to UTF-8 as defined in the Unicode standard or RFC 3629. I was for example sur- prised that Encode::is_utf8(...) considers sequences UTF-8 that are not UTF-8 as defined in those specifications (the documentation explicitly states "well-formed UTF-8"). Now that we have this problem, introducing more places where one needs to carefully check the documentation what is considered UTF-8 does not seem like the best option, having decode_utf8() and decode(utf8=>...) mean some- thing different is likely going to cause confusion. Maybe this could go the other way round, i.e. introduce a new encoding "UTF-8-Strict" or something. >This implies that encode("UTF-8", $string) can start failing while >previously it could not. As above, by default I do not think it should fail but rather use a replacement character instead of croaking. The result should be the same as (using RFC-3629-UTF-8 to mean the non-Perl UTF-8) encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string)) where decode("RFC-3629-UTF-8") would always return a RFC-3629-UTF-8 string with no illegal sequences (and as that should not fail, the above should not fail either). I.e. encode("RFC-3629-UTF-8" => $string) eq encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string)) would always hold true (assuming that decode("RFC-3629-UTF-8") would ignore that the UTF-8 flag on $string is already set and decode "again"). >Other suggestions or comments? There should be a corresponding is_foo function that checks whether a sequence of octets (or a string with the UTF-8 flag set) is actually UTF-8 as defined in the relevant specifications, maybe by adding one more argument to Encode::is_utf8 like Encode::is_utf8($string, $perl_utf8_check, $real_utf8_check) -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/