I'm approaching this from the premise that we would like to avoid
needless surprises for users not versed in text encoding. I did a simple
experiment with notepad on Windows 7 as if a naïve user. If I write the
one-line program:
print("Hello world.") # by Jeff
It runs, no surprise.
We may legitimately encounter Unicode in string literals and comments.
If I write:
print("j't'kif Anaïs!") # par Hervé
and try to save it, notepad tells me this file "contains characters in
Unicode format which will be lost if you save this as an ANSI encoded
text file." To keep the Unicode information I should cancel and choose a
Unicode option. In the "Save as" dialogue the default encoding is ANSI.
The second option "Unicode" is clearly right as the warning said
"Unicode" 3 times and I don't know what big-endian or UTF-8 mean. Good
that worked. Closed and opened it looks exactly as I typed it.
But the bytes I actually wrote on disk consist of a BOM and UTF-16-LE.
And running it I get:
File "bonjour.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xff' in file bonjour.py on
line 1, but no encoding declared; see
http://python.org/dev/peps/pep-0263/ for details
If I take the hint here and save as UTF-8, then it works, including
printing the accent. Inspection of the bytes shows it starts with a
UTF-8 BOM.
In Jython I get the same results (choking on UTF-16), but saved as
UTF-8, it works. I just have to make sure that's a Unicode constant if I
want it to print correctly, as we're at 2.7. Jython has a checkered past
with encodings, but tries to do exactly the same as CPython 2.7.x.
Now, a fact I haven't mentioned is that my machine was localised to
simplified Chinese (to diagnose some bug) during this test. If I
re-localise to my usual English (UK), I do not get the guidance from
notepad: instead it quietly saves as Latin-1 (cp1252), perhaps because
I'm in Western Europe. Python baulks at this, at the first accented
character. If I save from notepad as Unicode or UTF-8 the results are as
before, including the BOM.
In some circumstances, then, the natural result of using notepad and not
sticking to ASCII may be UTF-16-LE with a BOM, or Latin-1 depending on
localisation, it seems. The Python error message provides a clue what a
user should do, but they would need some background, a helpful teacher,
or the Internet to sort it out.
Jeff Allen
On 15/11/2015 07:23, Stephen J. Turnbull wrote:
Steve Dower writes:
> Saying [UTF-16] is rarely used is rather exposing your own
> unawareness though - it could arguably be the most commonly used
> encoding (depending on how you define "used").
Because we're discussing the storage of .py files, the relevant
definition is the one used by the Unicode Standard, of course: a
text/plain stream intended to be manipulated by any conformant Unicode
processor that claims to handle text/plain. File formats with in-band
formatting codes and allowing embedded non-text content like Word, or
operating system or stdlib APIs, don't count. Nor have I seen UTF-16
used in email or HTML since the unregretted days of Win2k betas[1]
(but I don't frequent Windows- or Java-oriented sites, so I have to
admit my experience is limited in a possibly relevant way).
In Japan my impression is that modern versions of Windows have
Memopad[sic] configured to emit UTF-8-with-signature by default for
new files, and if not, the abomination known as Shift JIS (I'm not
sure if that is a user or OEM option, though). Never a widechar
encoding (after all, the whole point of Shift JIS was to use an 8-bit
encoding for the katakana syllabary to save space or bandwidth).
I think if anyone wants to use UTF-16 or UTF-32 for exchange of Python
programs, they probably already know how to convert them to UTF-8. As
somebody already suggested, this can be delegated to the py.exe
launcher, if necessary, AFAICS.
I don't see any good reason for allowing non-ASCII-compatible
encodings in the reference CPython interpreter.
However, having mentioned Windows and Java, I have to wonder about
IronPython and Jython, respectively. Having never lived in either of
those environments, I don't know what text encoding their users might
prefer (or even occasionally encounter) in Python program source.
Steve
Footnotes:
[1] The version of Outlook Express shipped with them would emit
"HTML" mail with ASCII tags and UTF-8-encoded text (even if it was
encodable in pure ASCII). No, it wasn't spam, either, so it probably
really was Outlook Express as it claimed to be in one of the headers.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:https://mail.python.org/mailman/options/python-dev/ja.py%40farowl.co.uk
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com