On Mon, 13 Aug 2001, Philip Newton wrote:

> On 12 Aug 01, at 18:42, Prymmer/Kahn wrote:
> 
> > On Sun, 12 Aug 2001, Jarkko Hietaniemi wrote:
> > 
> > > Summary: I suggest deprecating E<nn> where nn < 256 since it is not portable.
> > 
> > I think that strategy might be too drastic.  Why deprecate?  Why not
> > simply warn about the unportabilty but still allow the flexability
> > afforded by numeric character specification?
> 
> What's flexible about something where you have no idea what will come 
> out? On the other hand, it's as flexible as putting in raw bytes 
> without specifying the character set -- it'll look OK if the receiver 
> has the same coded character set as the sender, but otherwise possibly 
> not.

If I am on a platform where � is encoded at codepoint 82 (rather than 
at 234) I could follow a pod spec that told me that "E<number> gives
you the character at codepoint number in your local coded character
set" and I could write my pod as:

   The e with circumflex (E<81>) is at codepoint 82 in IBM-1047

and not expect that to work on an ASCII based machine. The standard could
then be said to be flexible in that it allows for the differences in coded
character sets used across various computer platforms.  An inflexible pod
spec would have to dictate what integer was to be used for every glyph.

> > Specifying numeric codepoints may prove to be a popular thing given
> > the rather sorry state of input methods among common text editors. 
> 
> If non-ASCII characters are replaced by E<234> automatically (say, by a 
> script run over the finished text), then they could (nearly) as easily 
> be replaced by E<ecirc> references; and if they are entered by hand, 
> then IMO the mnemonic E<ecirc> is easier to remember than E<234>. (I 
> doubt that text editors will produce such things themselves on a 
> keypress of � due to the "rather sorry state of input methods" you 
> mentioned.) I'm not sure what's gained by allowing E<234> if you don't 
> also mandate "this means code point 234 in the character set X" -- 
> regardless of whether "X" eq "Unicode" or "EBCDIC" or "Latin-9" or 
> whatever.
> 
> I think I agree with Jarkko that E<nnn> to nnn < 256 should either be 
> deprecated (yes, even for nnn < 127) or be specified as being in a 
> specific character set (for example, Unicode, for compatibility with 
> E<nnn> for nnn > 255).

I think that I agree with both you and Jarkko there.  Saying that E<234>
means � would have to be qulaified with "that is true at least for ISO
8859-1 coded character sets" or somesuch.  However imposing the rule that
E<234> has to mean � on all platforms is simply ridiculous: perl does 
not have that great a control over the fonts used in a given user
display application (xterm, MPW shell SIOW window, MS-DOS DVM, etc.).

> I would also suggest that raw bytes by interpreted as UTF-8 in the 
> absence of other indications of encoding (such as UTF-16 BOMs); this 
> would automatically mean that text written in ASCII environments would 
> be interpreted correctly, since the byte representation for the subset 
> of Unicode corresponding to ASCII is identical between ASCII and UTF-8.

Part of the instructions for unpacking a perl source tar ball on say z/OS
is to use unpax like so:

  pax -o to=IBM-1047,from=ISO8859-1 -r < perl$n.tar

which untars the perl$n.tar ball and translates the contents of all files
from the ASCII encoding to the local EBCDIC encoding.  In other words all
pod files that had a 65 coded character in them will have that spot
translated to 193 on z/OS.  In that way the letter 'A' that appears in the
pod file on an ASCII machine will appear to contain the letter 'A' on an
EBCDIC machine (in that way the pod source can be read using
'less', 'more', 'vi' or 'emacs' on z/OS. Recall that the p in pod stands
for "plain") Imposing the rule that everything in the pod file has to
be assumed to be in the UTF-8 encoding means that pod converters such as
pod2man ought not be used at all on EBCDIC platforms unfortunately.  One
certainly would not want to use 'more' to read through pod/perlpod.pod if
it had to be back converted to a bunch of ASCII garbage on z/OS.
(BTW the suggestion of leaving all perl source in ASCII encoded form won't
work either since the C compiler would not understand source code
that is not in the IBM-1047 encoding).

Peter Prymmer


Reply via email to