Helleu, Pod pals!
Short version about "Re: Assume CP1252"-- I advise: yes, assume CP1252 where technically you were expecting Latin-1.

 ~~

Long version:

I don't normally pipe up about (or keep up with anything about) Pod stuff, because it's yall's language now-- but since an issue of my original intent has come up, and it shunted into my normal inbox, I'll jump in:

On 01/05/2015 10:58 PM, David E. Wheeler wrote:

[...] Pod Peeps:
 if the first highbit byte sequence in the file seems valid as a UTF-8
 sequence, or otherwise as Latin-1.
[...]  I suggest we switch from Latin-1 to CP1252. [...]

I agree completely, go for it!

Yes:
* assume that input is CP1252 in the absence of any encoding being declared
* assume that input is CP1252 if the declared encoding is Latin-1

As far as I know, that amicable bait-and-switch (i.e., construing Latin-1 to actually mean the superset CP1252) means in practice that everybody wins, and nobody loses, and DWIM prevails yet again.

Moreover, this construal of Latin-1 as CP1252 has significant precedent:

«Most modern web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 to accommodate such mislabeling. This is now standard behavior in the draft HTML 5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.»

And it obeys Postel's law:
"Be conservative in what you do; be liberal in what you accept from others."

And...
  http://www.w3.org/TR/encoding/#names-and-labels
even seems to tolerate more things, to a point, if I'm reading it right. Dunno. On this point, it's up to you folks.


BTW: I think many people would appreciate having "=encoding ansi" tolerated as a synonym for "=encoding win-1252"... because some systems simply call it that-- and I can never remember 1252 vs 1250 vs my own zipcode vs last four digits of my Antarctican passport, etc.


Incidentally, you presumably might want to expand the %Latin1Code_to_fallback table in Pod::Escapes.

(...which reminds me to push out some more versions of Unidecode, notably one that covers the symbol for the now very eventful ruble.)


Now, there's two issues that may or may not be already seen as separate:
* assuming that input is CP1252 in the absence of any encoding being declared
* assuming that input is CP1252 if the declared encoding is Latin-1
I suggest doing both (like HTML5)-- but at least the first definitely!


If anyone wants extreme S&M, maybe a throw a note in WARNINGS about "I expected this to be in Latin-1 but it looks like maybe you should probably have a '=encoding win1252' line." But that seems a case of pointless and even onerous obtuseness, instead of unproblematic DWIM. I think.


I’ve discussed this with Sean Burke in the last couple years, and IIRC he said 
he probably should have assumed CP1252 instead of Latin-1 when he wrote it.

True enough!

But not if there are flaws with the plan. Thoughts? Should we make this change? 
Seems like a win overall to me, but I miss details all the time. Let me know 
your thoughts.


As to possible flaws, I see two that are on the very edge of remote possibility.
But, for sake of completeness, I'll note:

* I think using characters 0x80-0x9F might just conceivably screw up some crazy text editors' "what encoding is this?" guesswork-- with what consequences I don't know. But, ya know, as Paul F. Tompkins says: "We are living in a year with a TWO IN FRONT OF IT!", so any editor that silently guesses that way, and somehow silently makes bad things happen, should have already been pushed out an airlock at least a decade ago.

* And, speaking of heuristics: I think the recognition heuristics in Unix's file(1) might... remotely, conceivably... change file(1)'s opinion of what a pure-Pod input file is, from yes to no, if it construes a file that has 0x80-0x9F but also has "=encoding latin-1" as a paradox that means something not-Pod. Hypothetically. But that is far beyond any sense that file(1) can be expected to *reliably* have (or maybe can even express in its recognition rules). Already file(1) is just catastrophically dumb at anything other than answering thins like "is this extensionless file a GIF?", because beyond that, it already guesses wrong more often than right.

I've just now run it on Pod/Simple.pod and it said
"C source, ASCII text"
Boioiooing.

And I've just now run it on a s2763_sjis.pod I had lying around, which has two kanji in the first 64 bytes-- and with a "=encoding shiftjis" being the second line in the file!, and file(1) said: "Perl POD document, Non-ISO extended-ASCII text, with CRLF, NEL line terminators"

So... Don't overthink why file(1) does what it does-- *it* certainly doesn't overthink it.


I hope this message has helped.
REESE'S PIECES OUT.

Reply via email to