Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns

2015-01-13 Thread David E. Wheeler
On Jan 12, 2015, at 11:42 AM, David E. Wheeler da...@justatheory.com wrote:

 Honest, since the current regex matches stuff that is not in fact Pod, I 
 think it is reasonable to tighten up the regex to
 
/\A=([a-zA-Z]+[0=9]*)\b/

That one, it turns out, was no less liberal than the previous regex. I added a 
test matching the pattern Randy identified, and it failed with this regex, too. 
So I instead copied the regex from later in the file, which *is* sufficiently 
more strict, and brings them into line, to boot. The change is here:

  https://github.com/theory/pod-simple/commit/31942ec

Look good? If so, I will update perlpodspec to match it and send it off to p5p.

Best,

David



smime.p7s
Description: S/MIME cryptographic signature


Re: Assume CP1252

2015-01-13 Thread Karl Williamson

On 01/12/2015 01:27 PM, Karl Williamson wrote:

On 01/12/2015 12:49 PM, David E. Wheeler wrote:

On Jan 12, 2015, at 11:46 AM, Karl Williamson
pub...@khwilliamson.com wrote:


I ran across this link, but didn't see what action was taken on it:
http://www.w3.org/TR/newline


Pardon my ignorance. Does that mean that `s/Latin-1/CP1252/g` could be
a mistake on EBCDIC?

David



Yes, that's essentially what I meant when I said in an earlier email
that NEL is THE new-line character on os390, which generally runs using
EBCDIC.  The code point for NEL in cp1252 is a horizontal ellipsis, and
not a next line, but on some platforms, like os390, it means next
line.   This is a conflict.

However, now that I think about it, when I look at os390 runs, I rarely
see NELs.  Maybe there is a filter that translates them to \n before the
pod sees it, but sometimes, I do see NEL all over the place but no \n.
I'll ask on the perl-mvs list about this.



tl;dr:  I was wrong to think there was a problem in s/latin1/cp1252/ for 
EBCDIC.


In researching the issue in order to create an intelligent posting, I 
found the answer.


It is an undocumented subtlety with Perl's EBCDIC implementation, that I 
was surprised I didn't know, as I've been pretty deep into that 
implementation.


And it's interesting (at least to me), so I'll document it here (as well 
as make corrections to perlebcdic.pod).


As many of you know, ASCII has both CR and LF characters that are used 
variously as line termination characters.  Old Apple used CR, and 
Windows uses the combination CR-LF.  Perl handled the Apple issue by 
swapping the meanings of \r and \n there; it handles CR-LF by having an 
I/O layer that makes CR-LF appears as a single \n internally so the 
gotchas are hidden from most applications.


In addition, Unicode defines the NEL (next line) character which is an 
another alternative line terminator.  Its code point is the one that 
CP1252 uses instead to mean a horizontal ellipsis.


It turns out that NEL is the character that os390 uses as its line 
terminator, not CR nor LF.  It is called NL in EBCDIC.  (NL is 
unfortunately a synonym for LF in ASCII and Unicode terminology.)


What Perl does to handle this is to simple swap the NEL and LF code 
points.  That makes \n mean NEL instead of LF.  Apparently LF is unused 
in EBCDIC applications, so it works.  There is official support for this 
swap, as Unicode's definition of how to get UTF-8 to work on EBCDIC 
platforms says to do the swap.


It does mean that NL doesn't mean the character that a native EBCDIC 
speaker would think.


But the bottom line is that because of this character swapping, the NEL 
characters in EBCDIC appear as \n, so aren't a problem for CP1252.


Re: Assume CP1252

2015-01-13 Thread David E. Wheeler
On Jan 13, 2015, at 10:31 AM, Karl Williamson pub...@khwilliamson.com wrote:

 What Perl does to handle this is to simple swap the NEL and LF code points.  
 That makes \n mean NEL instead of LF.  Apparently LF is unused in EBCDIC 
 applications, so it works.  There is official support for this swap, as 
 Unicode's definition of how to get UTF-8 to work on EBCDIC platforms says to 
 do the swap.

Huh. Good to know (and have it documented now!).

 It does mean that NL doesn't mean the character that a native EBCDIC speaker 
 would think.
 
 But the bottom line is that because of this character swapping, the NEL 
 characters in EBCDIC appear as \n, so aren't a problem for CP1252.

Nice. So should we then adopt the same pattern as the HTML 5 spec?

And I wonder if that W3 spec issue you pointed to the other day could use a 
comment to this effect.

Best,

David



smime.p7s
Description: S/MIME cryptographic signature