Assume CP1252

David E . Wheeler Mon, 05 Jan 2015 21:59:33 -0800

Pod Peeps:

perlpodspec says:


   *   Since Perl recognizes a Unicode Byte Order Mark at the start of files
       as signaling that the file is Unicode encoded as in UTF-16 (whether
       big-endian or little-endian) or UTF-8, Pod parsers should do the same.
       Otherwise, the character encoding should be understood as being UTF-8
       if the first highbit byte sequence in the file seems valid as a UTF-8
       sequence, or otherwise as Latin-1.

I suggest we switch from Latin-1 to CP1252. The reasons are:

* CP1252 is effectively a superset of Latin-1.

* Sometimes characters valid in CP1252 but not in Latin-1 appear in Pod, 
typically curly quotes or m-dashes or similar pasted from Word. The usual 
suspects are listed in this table:

 
http://search.cpan.org/dist/Encode-ZapCP1252/lib/Encode/ZapCP1252.pm#Conversion_Table

* By assuming CP1252 instead of Latin-1, such characters would be properly 
decoded when parsing Pod, thus making them come out right in the resulting 
outputs. Latin-1 should be unaffected.

So I think it would get better output for those documents that include special 
Windows characters, without side effects. We would just get a little more stuff 
to be output properly. I’ve discussed this with Sean Burke in the last couple 
years, and IIRC he said he probably should have assumed CP1252 instead of 
Latin-1 when he wrote it. It’s coming up again now because Karl Williamson has 
been improving the EBCDIC support recently, which is the same bit of code (it’s 
all about encodings, you know?), so this would be a natural time/place to do it.

But not if there are flaws with the plan. Thoughts? Should we make this change? 
Seems like a win overall to me, but I miss details all the time. Let me know 
your thoughts.

Best,

David

smime.p7s
Description: S/MIME cryptographic signature

Assume CP1252

Reply via email to