Re: Make Encode.pm support the real UTF-8

Tim Bunce Fri, 03 Dec 2004 14:12:42 -0800

On Sat, Dec 04, 2004 at 04:06:46AM +0900, Dan Kogai wrote:
> On Dec 02, 2004, at 23:25, Tim Bunce wrote:
> >On Wed, Dec 01, 2004 at 01:28:05PM -0800, Gisle Aas wrote:
> >>As you probably know perl's version of UTF-8 is not the real thing.  I
> >>thought I would hack up a patch to support the encoding as defined by
> >>Unicode.  That involves rejecting illegal chars (like surrogates,
> >>"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences
> >>and such.
> >
> >It's worth remembering that overlong sequences are a potential 
> >security risk.
> >
> >>Before I do this I would like to get some feedback on the interface.
> >>My prefered interface would be to make:
> >>
> >>   encode("UTF-8", $string)
> >>
> >>imply the official restricted form
> >
> >I think that would be best.
> 
> But to what extent?  Does it mean restricted, but unused codepoints 
> (i.e. U+10F000) to be illegal?  Does that mean we have to verify and if 
> necessary, patch perl anytime Unicode.org updates Unicode?
> 
> While I agree official UTF-8 be supported separately from "Perl" UTF-8,


Okay.

> I would like perl to be independent from unicode.org.  Remember that 
> perl community does not have a vote in unicode.org (or does it?).  
> Making perl too compliant to the Unicode standard means that perl is at 
> a mercy thereof.

Whoa. We agree official UTF-8 be supported separately from "Perl" UTF-8.
So there must be two names.

Then this thread boils down to what to call them.

> >>This implies that encode("UTF-8", $string) can start failing while
> >>previously it could not.
> >
> >Anyone working with valid UTF-8 would not get failures.
> >Anyone who thinks they're using valid UTF-8 but aren't should be 
> >grateful!
> >Anyone not using valid UTF-8 (eg using it as a way to encode integers)
> >needs to be told in advance - but I doubt there are many and they're
> >likely to be cluefull users who read release notes :)
> 
> There are many movements and implementations that "extends" Unicode by 
> making use of codepoints beyond 0x10FFFF.  Current perl can accept 
> them;  "Real", official unicode cannot.

Sure. I've used perl utf8 for packing large integers myself. That's
not the issue here. The issue is what to call the two encodings.

> >I'd say "UTF-8" should mean the official restricted form for perl 5.10.
> 
> Perl is a language where "use strict" is not default.  Why make its 
> default encoding strict then?  Perl should be liberal, not official.

I didn't actually say that perl's default encoding should be strict,
though I can see how it came across that way.

I'm only saying that the Unicode standard is called "UTF-8" and if
that's what a script explicitly asks for then that's what it should get.

> So my proposal is opposite;  Leave "utf8" and "UTF-8" as it is now and 
> define "UTF-8-official" or "UTF-8-pedantic" or whatever.

Security is for everyone, not just pedants. This is a bit dated but was
the best I could find http://www.izerv.net/idwg-public/archive/0181.html

> >The only remaining issues are then what to do for 5.8.7
> >and what to call the unrestricted encoding.
> 
> I would like to keep calling that 'utf8'.

I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
but "UTF-8" is the name of the standard and should give the
corresponding behaviour.

Tim.

Re: Make Encode.pm support the real UTF-8

Reply via email to