On 01/12/2015 01:27 PM, Karl Williamson wrote:
On 01/12/2015 12:49 PM, David E. Wheeler wrote:
On Jan 12, 2015, at 11:46 AM, Karl Williamson
<pub...@khwilliamson.com> wrote:

I ran across this link, but didn't see what action was taken on it:
http://www.w3.org/TR/newline

Pardon my ignorance. Does that mean that `s/Latin-1/CP1252/g` could be
a mistake on EBCDIC?

David


Yes, that's essentially what I meant when I said in an earlier email
that NEL is THE new-line character on os390, which generally runs using
EBCDIC.  The code point for NEL in cp1252 is a horizontal ellipsis, and
not a "next line", but on some platforms, like os390, it means "next
line".   This is a conflict.

However, now that I think about it, when I look at os390 runs, I rarely
see NELs.  Maybe there is a filter that translates them to \n before the
pod sees it, but sometimes, I do see NEL all over the place but no \n.
I'll ask on the perl-mvs list about this.


tl;dr: I was wrong to think there was a problem in s/latin1/cp1252/ for EBCDIC.

In researching the issue in order to create an intelligent posting, I found the answer.

It is an undocumented subtlety with Perl's EBCDIC implementation, that I was surprised I didn't know, as I've been pretty deep into that implementation.

And it's interesting (at least to me), so I'll document it here (as well as make corrections to perlebcdic.pod).

As many of you know, ASCII has both CR and LF characters that are used variously as line termination characters. Old Apple used CR, and Windows uses the combination CR-LF. Perl handled the Apple issue by swapping the meanings of \r and \n there; it handles CR-LF by having an I/O layer that makes CR-LF appears as a single \n internally so the gotchas are hidden from most applications.

In addition, Unicode defines the NEL (next line) character which is an another alternative line terminator. Its code point is the one that CP1252 uses instead to mean a horizontal ellipsis.

It turns out that NEL is the character that os390 uses as its line terminator, not CR nor LF. It is called NL in EBCDIC. (NL is unfortunately a synonym for LF in ASCII and Unicode terminology.)

What Perl does to handle this is to simple swap the NEL and LF code points. That makes \n mean NEL instead of LF. Apparently LF is unused in EBCDIC applications, so it works. There is official support for this swap, as Unicode's definition of how to get UTF-8 to work on EBCDIC platforms says to do the swap.

It does mean that NL doesn't mean the character that a native EBCDIC speaker would think.

But the bottom line is that because of this character swapping, the NEL characters in EBCDIC appear as \n, so aren't a problem for CP1252.

Reply via email to