Bjoern Hoehrmann <[EMAIL PROTECTED]> writes: > * Gisle Aas wrote: > >As you probably know perl's version of UTF-8 is not the real thing. I > >thought I would hack up a patch to support the encoding as defined by > >Unicode. That involves rejecting illegal chars (like surrogates, > >"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences > >and such. > > I would very much like to have this functionality available in some > standard module. Though, what do you mean here by rejecting exactly?
It would do the same as it currently does for illegal UTF-8-Perl. It falls back to what the CHECK argument ask for. > For example, by default, I would expect > > decode("UTF-8" => "Bj\xF6rn") > > to return "Bj\x{FFFD}rn" as documented in `perldoc Encode`; would > this change (i.e., would it croak instead)? It would be exactly the same. More interesting is: decode("UTF8", "Bj\xEF\xBF\xBFrn") where "\xEF\xBF\xBF" is not legal UTF-8 because "\x{FFFF}" is not legal Unicode. Either the whole sequence "\xEF\xBF\xBF" is replaced by "\x{FFFD}" or each bad byte is giving us "Bj\x{FFFD}\x{FFFD}\x{FFFD}rn". I think the later will be more sane, especially when you hit on perl 64-bit extension to UTF-8.. > >Before I do this I would like to get some feedback on the interface. > >My prefered interface would be to make: > > > > encode("UTF-8", $string) > > > >imply the official restricted form and then have > > > > encode("UTF-8-Perl", $string) > > > >be used as the name for Perl's relaxed and extended version of the > >encoding. The encode_utf8($string) function would continue to be the > >same as encode("UTF-8-Perl", $string). > > I would prefer there was no semantic overloading of "UTF-8" at all, > I generally expect that anything called UTF-8 refers to UTF-8 as > defined in the Unicode standard or RFC 3629. I was for example sur- > prised that Encode::is_utf8(...) considers sequences UTF-8 that are > not UTF-8 as defined in those specifications (the documentation > explicitly states "well-formed UTF-8"). This can be fixed by fixing the documentation. It might be possible to get a way by making a distinction between 'utf8' and 'UTF-8'. The former being the perl variant while we reserve uppercase form with dash for the real UTF-8. > Now that we have this problem, introducing more places where one needs > to carefully check the documentation what is considered UTF-8 does not > seem like the best option, having decode_utf8() and decode(utf8=>...) > mean some- thing different is likely going to cause confusion. Maybe > this could go the other way round, i.e. introduce a new encoding > "UTF-8-Strict" or something. This is certainly more backwards compatible, but do we really want perl applications to exchange illegal UTF-8 by default? > >This implies that encode("UTF-8", $string) can start failing while > >previously it could not. > > As above, by default I do not think it should fail but rather use a > replacement character instead of croaking. Yes. By failing I mean; handle the bad bytes as specified by the CHECK argument. > The result should be the > same as (using RFC-3629-UTF-8 to mean the non-Perl UTF-8) > > encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string)) > > where decode("RFC-3629-UTF-8") would always return a RFC-3629-UTF-8 > string with no illegal sequences (and as that should not fail, the > above should not fail either). I.e. > > encode("RFC-3629-UTF-8" => $string) eq > encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string)) > > would always hold true (assuming that decode("RFC-3629-UTF-8") would > ignore that the UTF-8 flag on $string is already set and decode > "again"). > > >Other suggestions or comments? > > There should be a corresponding is_foo function that checks whether > a sequence of octets (or a string with the UTF-8 flag set) is actually > UTF-8 as defined in the relevant specifications, maybe by adding one > more argument to Encode::is_utf8 like > > Encode::is_utf8($string, $perl_utf8_check, $real_utf8_check) Agree. Regards, Gisle