Bjoern Hoehrmann <[EMAIL PROTECTED]> writes:
> * Gisle Aas wrote:
> >As you probably know perl's version of UTF-8 is not the real thing. I
> >thought I would hack up a patch to support the encoding as defined by
> >Unicode. That involves rejecting illegal chars (like surrogates,
> >"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences
> >and such.
>
> I would very much like to have this functionality available in some
> standard module. Though, what do you mean here by rejecting exactly?
It would do the same as it currently does for illegal UTF-8-Perl. It
falls back to what the CHECK argument ask for.
> For example, by default, I would expect
>
> decode("UTF-8" => "Bj\xF6rn")
>
> to return "Bj\x{FFFD}rn" as documented in `perldoc Encode`; would
> this change (i.e., would it croak instead)?
It would be exactly the same.
More interesting is:
decode("UTF8", "Bj\xEF\xBF\xBFrn")
where "\xEF\xBF\xBF" is not legal UTF-8 because "\x{FFFF}" is not
legal Unicode. Either the whole sequence "\xEF\xBF\xBF" is replaced
by "\x{FFFD}" or each bad byte is giving us
"Bj\x{FFFD}\x{FFFD}\x{FFFD}rn". I think the later will be more sane,
especially when you hit on perl 64-bit extension to UTF-8..
> >Before I do this I would like to get some feedback on the interface.
> >My prefered interface would be to make:
> >
> > encode("UTF-8", $string)
> >
> >imply the official restricted form and then have
> >
> > encode("UTF-8-Perl", $string)
> >
> >be used as the name for Perl's relaxed and extended version of the
> >encoding. The encode_utf8($string) function would continue to be the
> >same as encode("UTF-8-Perl", $string).
>
> I would prefer there was no semantic overloading of "UTF-8" at all,
> I generally expect that anything called UTF-8 refers to UTF-8 as
> defined in the Unicode standard or RFC 3629. I was for example sur-
> prised that Encode::is_utf8(...) considers sequences UTF-8 that are
> not UTF-8 as defined in those specifications (the documentation
> explicitly states "well-formed UTF-8").
This can be fixed by fixing the documentation. It might be possible
to get a way by making a distinction between 'utf8' and 'UTF-8'. The
former being the perl variant while we reserve uppercase form with
dash for the real UTF-8.
> Now that we have this problem, introducing more places where one needs
> to carefully check the documentation what is considered UTF-8 does not
> seem like the best option, having decode_utf8() and decode(utf8=>...)
> mean some- thing different is likely going to cause confusion. Maybe
> this could go the other way round, i.e. introduce a new encoding
> "UTF-8-Strict" or something.
This is certainly more backwards compatible, but do we really want
perl applications to exchange illegal UTF-8 by default?
> >This implies that encode("UTF-8", $string) can start failing while
> >previously it could not.
>
> As above, by default I do not think it should fail but rather use a
> replacement character instead of croaking.
Yes. By failing I mean; handle the bad bytes as specified by the
CHECK argument.
> The result should be the
> same as (using RFC-3629-UTF-8 to mean the non-Perl UTF-8)
>
> encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string))
>
> where decode("RFC-3629-UTF-8") would always return a RFC-3629-UTF-8
> string with no illegal sequences (and as that should not fail, the
> above should not fail either). I.e.
>
> encode("RFC-3629-UTF-8" => $string) eq
> encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string))
>
> would always hold true (assuming that decode("RFC-3629-UTF-8") would
> ignore that the UTF-8 flag on $string is already set and decode
> "again").
>
> >Other suggestions or comments?
>
> There should be a corresponding is_foo function that checks whether
> a sequence of octets (or a string with the UTF-8 flag set) is actually
> UTF-8 as defined in the relevant specifications, maybe by adding one
> more argument to Encode::is_utf8 like
>
> Encode::is_utf8($string, $perl_utf8_check, $real_utf8_check)
Agree.
Regards,
Gisle