On 8/19/2012 2:11 PM, wxjmfa...@gmail.com wrote:
Well, it seems some software producers know what they
are doing.
'€'.encode('cp1252')
b'\x80'
'€'.encode('mac-roman')
b'\xdb'
'€'.encode('iso-8859-1')
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
in position 0: ordinal not in range(256)
Yes, Python lets you choose your byte encoding from those and a hundred
others. I believe all the codecs are now tested in both directions. It
was not an easy task.
As to the examples: Latin-1 dates to 1985 and before and the 1988
version was published as a standard in 1992.
https://en.wikipedia.org/wiki/Latin-1
"The name euro was officially adopted on 16 December 1995."
https://en.wikipedia.org/wiki/Euro
No wonder Latin-1 does not contain the Euro sign. International
standards organizations standards are relatively fixed. (The unicode
consortium will not even correct misspelled character names.) Instead,
new standards with a new number are adopted.
For better or worse, private mappings are more flexible. In its Mac
mapping Apple "replaced the generic currency sign ¤ with the euro sign
€". (See Latin-1 reference.) Great if you use Euros, not so great if you
were using the previous sign for something else.
Microsoft changed an unneeded code to the Euro for Windows cp-1252.
https://en.wikipedia.org/wiki/Windows-1252
"It is very common to mislabel Windows-1252 text with the charset label
ISO-8859-1. A common result was that all the quotes and apostrophes
(produced by "smart quotes" in Microsoft software) were replaced with
question marks or boxes on non-Windows operating systems, making text
difficult to read. Most modern web browsers and e-mail clients treat the
MIME charset ISO-8859-1 as Windows-1252 in order to accommodate such
mislabeling. This is now standard behavior in the draft HTML 5
specification, which requires that documents advertised as ISO-8859-1
actually be parsed with the Windows-1252 encoding.[1]"
Lots of fun. Too bad Microsoft won't push utf-8 so we can all
communicate text with much less chance of ambiguity.
--
Terry Jan Reedy
--
http://mail.python.org/mailman/listinfo/python-list