Re: Assume CP1252

2015-01-10 Thread Sean Burke

Helleu, Pod pals!
Short version about Re: Assume CP1252-- I advise: yes, assume CP1252 
where technically you were expecting Latin-1.


 ~~

Long version:

I don't normally pipe up about (or keep up with anything about) Pod 
stuff, because it's yall's language now-- but since an issue of my 
original intent has come up, and it shunted into my normal inbox, I'll 
jump in:


On 01/05/2015 10:58 PM, David E. Wheeler wrote:


[...] Pod Peeps:
 if the first highbit byte sequence in the file seems valid as a UTF-8
 sequence, or otherwise as Latin-1.
[...]  I suggest we switch from Latin-1 to CP1252. [...]


I agree completely, go for it!

Yes:
* assume that input is CP1252 in the absence of any encoding being 
declared

* assume that input is CP1252 if the declared encoding is Latin-1

As far as I know, that amicable bait-and-switch (i.e., construing 
Latin-1 to actually mean the superset CP1252) means in practice that 
everybody wins, and nobody loses, and DWIM prevails yet again.


Moreover, this construal of Latin-1 as CP1252 has significant precedent:

«Most modern web browsers and e-mail clients treat the MIME charset 
ISO-8859-1 as Windows-1252 to accommodate such mislabeling.  This is 
now standard behavior in the draft HTML 5 specification, which 
requires that documents advertised as ISO-8859-1 actually be parsed 
with the Windows-1252 encoding.»


And it obeys Postel's law:
Be conservative in what you do; be liberal in what you accept from 
others.


And...
  http://www.w3.org/TR/encoding/#names-and-labels
even seems to tolerate more things, to a point, if I'm reading it 
right.  Dunno.  On this point, it's up to you folks.



BTW: I think many people would appreciate having =encoding ansi 
tolerated as a synonym for =encoding win-1252... because some 
systems simply call it that-- and I can never remember 1252 vs 1250 vs 
my own zipcode vs last four digits of my Antarctican passport, etc.



Incidentally, you presumably might want to expand the 
%Latin1Code_to_fallback table in Pod::Escapes.


(...which reminds me to push out some more versions of Unidecode, 
notably one that covers the symbol for the now very eventful ruble.)



Now, there's two issues that may or may not be already seen as separate:
* assuming that input is CP1252 in the absence of any encoding being 
declared

* assuming that input is CP1252 if the declared encoding is Latin-1
I suggest doing both (like HTML5)-- but at least the first definitely!


If anyone wants extreme SM, maybe a throw a note in WARNINGS about I 
expected this to be in Latin-1 but it looks like maybe you should 
probably have a '=encoding win1252' line.
But that seems a case of pointless and even onerous obtuseness, 
instead of unproblematic DWIM.  I think.




I’ve discussed this with Sean Burke in the last couple years, and IIRC he said 
he probably should have assumed CP1252 instead of Latin-1 when he wrote it.


True enough!


But not if there are flaws with the plan. Thoughts? Should we make this change? 
Seems like a win overall to me, but I miss details all the time. Let me know 
your thoughts.



As to possible flaws, I see two that are on the very edge of remote 
possibility.

But, for sake of completeness, I'll note:

* I think using characters 0x80-0x9F might just conceivably screw up 
some crazy text editors' what encoding is this? guesswork-- with 
what consequences I don't know.
But, ya know, as Paul F. Tompkins says: We are living in a year with 
a TWO IN FRONT OF IT!, so any editor that silently guesses that way, 
and somehow silently makes bad things happen, should have already been 
pushed out an airlock at least a decade ago.


* And, speaking of heuristics: I think the recognition heuristics in 
Unix's file(1) might... remotely, conceivably... change file(1)'s 
opinion of what a pure-Pod input file is, from yes to no, if it 
construes a file that has 0x80-0x9F but also has =encoding latin-1 
as a paradox that means something not-Pod.  Hypothetically.
But that is far beyond any sense that file(1) can be expected to 
*reliably* have (or maybe can even express in its recognition rules).
Already file(1) is just catastrophically dumb at anything other than 
answering thins like is this extensionless file a GIF?, because 
beyond that, it already guesses wrong more often than right.


I've just now run it on Pod/Simple.pod and it said
C source, ASCII text
Boioiooing.

And I've just now run it on a s2763_sjis.pod I had lying around, which 
has two kanji in the first 64 bytes-- and with a =encoding shiftjis 
being the second line in the file!, and file(1) said:
Perl POD document, Non-ISO extended-ASCII text, with CRLF, NEL line 
terminators


So... Don't overthink why file(1) does what it does--  *it* certainly 
doesn't overthink it.



I hope this message has helped.
REESE'S PIECES OUT.



Re: Assume CP1252

2015-01-10 Thread David E. Wheeler
On Jan 10, 2015, at 5:48 PM, Sean Burke sbu...@cpan.org wrote:

 Helleu, Pod pals!
 Short version about Re: Assume CP1252-- I advise: yes, assume CP1252 where 
 technically you were expecting Latin-1.

Thanks for chiming in, Sean.

 I agree completely, go for it!
 
 Yes:
 * assume that input is CP1252 in the absence of any encoding being declared
 * assume that input is CP1252 if the declared encoding is Latin-1
 
 As far as I know, that amicable bait-and-switch (i.e., construing Latin-1 to 
 actually mean the superset CP1252) means in practice that everybody wins, and 
 nobody loses, and DWIM prevails yet again.

Right, I vaguely remember you telling me this before. I forgot about #2 (and 
the HTML 5 precedent).

 BTW: I think many people would appreciate having =encoding ansi tolerated 
 as a synonym for =encoding win-1252... because some systems simply call it 
 that-- and I can never remember 1252 vs 1250 vs my own zipcode vs last four 
 digits of my Antarctican passport, etc.

ansi == cp1252??

I think Encode determines aliases.

 Incidentally, you presumably might want to expand the %Latin1Code_to_fallback 
 table in Pod::Escapes.

Paging Neil Bowers.

 Now, there's two issues that may or may not be already seen as separate:
 * assuming that input is CP1252 in the absence of any encoding being declared
 * assuming that input is CP1252 if the declared encoding is Latin-1
 I suggest doing both (like HTML5)-- but at least the first definitely!

+1

 If anyone wants extreme SM, maybe a throw a note in WARNINGS about I 
 expected this to be in Latin-1 but it looks like maybe you should probably 
 have a '=encoding win1252' line.
 But that seems a case of pointless and even onerous obtuseness, instead of 
 unproblematic DWIM.  I think.

Meh. I'm thinking, however, of adding a note to the ChangeLog for the next 
release that this change will be in the following release. I’ve already added a 
note that support for Perls  5.5 will be dropped.

 As to possible flaws, I see two that are on the very edge of remote 
 possibility.
 But, for sake of completeness, I'll note:

Pretty obscure!

 I hope this message has helped.
 REESE'S PIECES OUT.

Thanks again!

Best,

David




smime.p7s
Description: S/MIME cryptographic signature