I'm approaching this from the premise that we would like to avoid needless surprises for users not versed in text encoding. I did a simple experiment with notepad on Windows 7 as if a naïve user. If I write the one-line program:

print("Hello world.") # by Jeff

It runs, no surprise.

We may legitimately encounter Unicode in string literals and comments. If I write:

print("j't'kif Anaïs!") # par Hervé

and try to save it, notepad tells me this file "contains characters in Unicode format which will be lost if you save this as an ANSI encoded text file." To keep the Unicode information I should cancel and choose a Unicode option. In the "Save as" dialogue the default encoding is ANSI. The second option "Unicode" is clearly right as the warning said "Unicode" 3 times and I don't know what big-endian or UTF-8 mean. Good that worked. Closed and opened it looks exactly as I typed it.

But the bytes I actually wrote on disk consist of a BOM and UTF-16-LE. And running it I get:
  File "bonjour.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xff' in file bonjour.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

If I take the hint here and save as UTF-8, then it works, including printing the accent. Inspection of the bytes shows it starts with a UTF-8 BOM.

In Jython I get the same results (choking on UTF-16), but saved as UTF-8, it works. I just have to make sure that's a Unicode constant if I want it to print correctly, as we're at 2.7. Jython has a checkered past with encodings, but tries to do exactly the same as CPython 2.7.x.

Now, a fact I haven't mentioned is that my machine was localised to simplified Chinese (to diagnose some bug) during this test. If I re-localise to my usual English (UK), I do not get the guidance from notepad: instead it quietly saves as Latin-1 (cp1252), perhaps because I'm in Western Europe. Python baulks at this, at the first accented character. If I save from notepad as Unicode or UTF-8 the results are as before, including the BOM.

In some circumstances, then, the natural result of using notepad and not sticking to ASCII may be UTF-16-LE with a BOM, or Latin-1 depending on localisation, it seems. The Python error message provides a clue what a user should do, but they would need some background, a helpful teacher, or the Internet to sort it out.

Jeff Allen

On 15/11/2015 07:23, Stephen J. Turnbull wrote:
Steve Dower writes:

  > Saying [UTF-16] is rarely used is rather exposing your own
  > unawareness though - it could arguably be the most commonly used
  > encoding (depending on how you define "used").

Because we're discussing the storage of .py files, the relevant
definition is the one used by the Unicode Standard, of course: a
text/plain stream intended to be manipulated by any conformant Unicode
processor that claims to handle text/plain.  File formats with in-band
formatting codes and allowing embedded non-text content like Word, or
operating system or stdlib APIs, don't count.  Nor have I seen UTF-16
used in email or HTML since the unregretted days of Win2k betas[1]
(but I don't frequent Windows- or Java-oriented sites, so I have to
admit my experience is limited in a possibly relevant way).

In Japan my impression is that modern versions of Windows have
Memopad[sic] configured to emit UTF-8-with-signature by default for
new files, and if not, the abomination known as Shift JIS (I'm not
sure if that is a user or OEM option, though).  Never a widechar
encoding (after all, the whole point of Shift JIS was to use an 8-bit
encoding for the katakana syllabary to save space or bandwidth).

I think if anyone wants to use UTF-16 or UTF-32 for exchange of Python
programs, they probably already know how to convert them to UTF-8.  As
somebody already suggested, this can be delegated to the py.exe
launcher, if necessary, AFAICS.

I don't see any good reason for allowing non-ASCII-compatible
encodings in the reference CPython interpreter.

However, having mentioned Windows and Java, I have to wonder about
IronPython and Jython, respectively.  Having never lived in either of
those environments, I don't know what text encoding their users might
prefer (or even occasionally encounter) in Python program source.

Steve

Footnotes:
[1]  The version of Outlook Express shipped with them would emit
"HTML" mail with ASCII tags and UTF-8-encoded text (even if it was
encodable in pure ASCII).  No, it wasn't spam, either, so it probably
really was Outlook Express as it claimed to be in one of the headers.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:https://mail.python.org/mailman/options/python-dev/ja.py%40farowl.co.uk


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to