Re: Make Encode.pm support the real UTF-8

Gisle Aas Thu, 02 Dec 2004 07:24:39 -0800

Bjoern Hoehrmann <[EMAIL PROTECTED]> writes:

> * Gisle Aas wrote:
> >As you probably know perl's version of UTF-8 is not the real thing.  I
> >thought I would hack up a patch to support the encoding as defined by
> >Unicode.  That involves rejecting illegal chars (like surrogates,
> >"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences
> >and such.
> 
> I would very much like to have this functionality available in some
> standard module. Though, what do you mean here by rejecting exactly?


It would do the same as it currently does for illegal UTF-8-Perl.  It
falls back to what the CHECK argument ask for.

> For example, by default, I would expect
> 
>   decode("UTF-8" => "Bj\xF6rn")
> 
> to return "Bj\x{FFFD}rn" as documented in `perldoc Encode`; would
> this change (i.e., would it croak instead)?

It would be exactly the same.

More interesting is:

   decode("UTF8", "Bj\xEF\xBF\xBFrn")

where "\xEF\xBF\xBF" is not legal UTF-8 because "\x{FFFF}" is not
legal Unicode.  Either the whole sequence "\xEF\xBF\xBF" is replaced
by "\x{FFFD}" or each bad byte is giving us
"Bj\x{FFFD}\x{FFFD}\x{FFFD}rn".  I think the later will be more sane,
especially when you hit on perl 64-bit extension to UTF-8..

> >Before I do this I would like to get some feedback on the interface.
> >My prefered interface would be to make:
> >
> >   encode("UTF-8", $string)
> >
> >imply the official restricted form and then have
> >
> >   encode("UTF-8-Perl", $string)
> >
> >be used as the name for Perl's relaxed and extended version of the
> >encoding.  The encode_utf8($string) function would continue to be the
> >same as encode("UTF-8-Perl", $string).
> 
> I would prefer there was no semantic overloading of "UTF-8" at all,
> I generally expect that anything called UTF-8 refers to UTF-8 as
> defined in the Unicode standard or RFC 3629. I was for example sur-
> prised that Encode::is_utf8(...) considers sequences UTF-8 that are
> not UTF-8 as defined in those specifications (the documentation
> explicitly states "well-formed UTF-8").

This can be fixed by fixing the documentation.  It might be possible
to get a way by making a distinction between 'utf8' and 'UTF-8'.  The
former being the perl variant while we reserve uppercase form with
dash for the real UTF-8.

> Now that we have this problem, introducing more places where one needs
> to carefully check the documentation what is considered UTF-8 does not
> seem like the best option, having decode_utf8() and decode(utf8=>...)
> mean some- thing different is likely going to cause confusion. Maybe
> this could go the other way round, i.e. introduce a new encoding
> "UTF-8-Strict" or something.

This is certainly more backwards compatible, but do we really want
perl applications to exchange illegal UTF-8 by default?

> >This implies that encode("UTF-8", $string) can start failing while
> >previously it could not.
> 
> As above, by default I do not think it should fail but rather use a
> replacement character instead of croaking.

Yes.  By failing I mean; handle the bad bytes as specified by the
CHECK argument.

>                                              The result should be the
> same as (using RFC-3629-UTF-8 to mean the non-Perl UTF-8)
> 
>   encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string))
> 
> where decode("RFC-3629-UTF-8") would always return a RFC-3629-UTF-8
> string with no illegal sequences (and as that should not fail, the
> above should not fail either). I.e.
> 
>   encode("RFC-3629-UTF-8" => $string) eq
>   encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string))
> 
> would always hold true (assuming that decode("RFC-3629-UTF-8") would
> ignore that the UTF-8 flag on $string is already set and decode
> "again").
> 
> >Other suggestions or comments?
> 
> There should be a corresponding is_foo function that checks whether
> a sequence of octets (or a string with the UTF-8 flag set) is actually
> UTF-8 as defined in the relevant specifications, maybe by adding one
> more argument to Encode::is_utf8 like
> 
>   Encode::is_utf8($string, $perl_utf8_check, $real_utf8_check)

Agree.

Regards,
Gisle

Re: Make Encode.pm support the real UTF-8

Reply via email to