At 12:39 PM 2001-08-12 -0500, Jarkko Hietaniemi wrote:
>[...]
>If people feel this way, okay, I can live with it (having UTF-8
>and Unicode being the pod defaults).
>
>But shouldn't we then also go for UTF-8 being the default in scripts?
>Oh well, I guess that would be too radical since that would blow scripts
>having literal 8-bit data (even Latin-1) in them out of the water.
>[...]

I hasten to note that having highbit literals in POD has so far been
/quite/ rare, so that my declaring utf8 to be the default breaks almost
nothing anywhere -- whereas declaring utf8 the default for program code has
potential to break rather more.  How that actually is, I don't know.

How feasable is it to say that if the first sequence of highbit characters
in a Perl source file is a valid UTF8 sequence, perl should presume it's a
UTF8 file?  That seems to go toward implementing UTF DWIM-ness that I
d(w)imly remember Larry going on about at TPC.  Whether this wreaks havoc
with toke.c, I don't know.  I dimly remember it being an explicit goal of
UTF8 that it uses byte sequences that almost never naturally occur (like
"é" for "�"), so that if you see a valid UTF8 sequence, it's probably
UTF8.  I wouldn't want to stake the world on that /always/ working, but I
think taht using that to guess at an encoding would probably do the right
thing in 99% of cases.

(Whether you might want to make that an option for discipline-undeclared
handles generally, is another question.  I'm quite out of the loop of
handle disciplines, so sorry if this is an old idea.)

[and in another message in the same thread]
>[...]
>If you want to count on something I hope you can be bothered to write
>E<ecircum> instead of E<234>.
>[...]

True, but mnemonics exist for only a tiny part of the codespace, and I was
bothered by that.  Making people stick to mnemonics while making utf8 the
POD default introduces a weird asymmetry, in that it would mean that
there's things (anything without a E<mnemonic>) that you can express with
literals but not with E<...>'s.
Or deprecating/nativizing E<num> for num < 256 introducing an asymmetry,
well, some of E<num> is native while E<biggernums> is in Unicode.

(BTW, it's "ecirc", not "ecircum".  That's one problem -- sometimes
mnemonics aren't.  But that's not as argument for E<number>, it's an
argument for allowing highbit literals and saying everyone has to (by
default) understand the literals the same way -- i.e., as utf8.)


BTW, as to the effects of making Unicode into POD's reference character set
for even EBCDIC people, I claim only blissful ignorance of the details of
Perl concepts of characters and ASCII and UTF8 under EBCDIC.  (That's the
main reason I just avoided any mention of EBCDIC in the spec.)
But I suppose I can merely summarize my position as "I'm happy as long as
you make E<64> synonymous with '@', and so on for hopefully all of USASCII
32-126, and hopefully also try doing good with at least the high-traffic
0xA0-0xFF area too."


--
Sean M. Burke  [EMAIL PROTECTED]  http://www.spinn.net/~sburke/

Reply via email to