On Sat, Dec 04, 2004 at 04:06:46AM +0900, Dan Kogai wrote: > On Dec 02, 2004, at 23:25, Tim Bunce wrote: > >On Wed, Dec 01, 2004 at 01:28:05PM -0800, Gisle Aas wrote: > >>As you probably know perl's version of UTF-8 is not the real thing. I > >>thought I would hack up a patch to support the encoding as defined by > >>Unicode. That involves rejecting illegal chars (like surrogates, > >>"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences > >>and such. > > > >It's worth remembering that overlong sequences are a potential > >security risk. > > > >>Before I do this I would like to get some feedback on the interface. > >>My prefered interface would be to make: > >> > >> encode("UTF-8", $string) > >> > >>imply the official restricted form > > > >I think that would be best. > > But to what extent? Does it mean restricted, but unused codepoints > (i.e. U+10F000) to be illegal? Does that mean we have to verify and if > necessary, patch perl anytime Unicode.org updates Unicode? > > While I agree official UTF-8 be supported separately from "Perl" UTF-8,
Okay. > I would like perl to be independent from unicode.org. Remember that > perl community does not have a vote in unicode.org (or does it?). > Making perl too compliant to the Unicode standard means that perl is at > a mercy thereof. Whoa. We agree official UTF-8 be supported separately from "Perl" UTF-8. So there must be two names. Then this thread boils down to what to call them. > >>This implies that encode("UTF-8", $string) can start failing while > >>previously it could not. > > > >Anyone working with valid UTF-8 would not get failures. > >Anyone who thinks they're using valid UTF-8 but aren't should be > >grateful! > >Anyone not using valid UTF-8 (eg using it as a way to encode integers) > >needs to be told in advance - but I doubt there are many and they're > >likely to be cluefull users who read release notes :) > > There are many movements and implementations that "extends" Unicode by > making use of codepoints beyond 0x10FFFF. Current perl can accept > them; "Real", official unicode cannot. Sure. I've used perl utf8 for packing large integers myself. That's not the issue here. The issue is what to call the two encodings. > >I'd say "UTF-8" should mean the official restricted form for perl 5.10. > > Perl is a language where "use strict" is not default. Why make its > default encoding strict then? Perl should be liberal, not official. I didn't actually say that perl's default encoding should be strict, though I can see how it came across that way. I'm only saying that the Unicode standard is called "UTF-8" and if that's what a script explicitly asks for then that's what it should get. > So my proposal is opposite; Leave "utf8" and "UTF-8" as it is now and > define "UTF-8-official" or "UTF-8-pedantic" or whatever. Security is for everyone, not just pedants. This is a bit dated but was the best I could find http://www.izerv.net/idwg-public/archive/0181.html > >The only remaining issues are then what to do for 5.8.7 > >and what to call the unrestricted encoding. > > I would like to keep calling that 'utf8'. I've no problem with 'utf8' being perl's unrestricted uft8 encoding, but "UTF-8" is the name of the standard and should give the corresponding behaviour. Tim.