Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

Jeff Allen Sun, 15 Nov 2015 07:21:01 -0800

I'm approaching this from the premise that we would like to avoidneedless surprises for users not versed in text encoding. I did a simpleexperiment with notepad on Windows 7 as if a naïve user. If I write theone-line program:


print("Hello world.") # by Jeff


It runs, no surprise.

We may legitimately encounter Unicode in string literals and comments.If I write:


print("j't'kif Anaïs!") # par Hervé

and try to save it, notepad tells me this file "contains characters inUnicode format which will be lost if you save this as an ANSI encodedtext file." To keep the Unicode information I should cancel and choose aUnicode option. In the "Save as" dialogue the default encoding is ANSI.The second option "Unicode" is clearly right as the warning said"Unicode" 3 times and I don't know what big-endian or UTF-8 mean. Goodthat worked. Closed and opened it looks exactly as I typed it.

But the bytes I actually wrote on disk consist of a BOM and UTF-16-LE.And running it I get:

  File "bonjour.py", line 1

SyntaxError: Non-UTF-8 code starting with '\xff' in file bonjour.py online 1, but no encoding declared; seehttp://python.org/dev/peps/pep-0263/ for details

If I take the hint here and save as UTF-8, then it works, includingprinting the accent. Inspection of the bytes shows it starts with aUTF-8 BOM.

In Jython I get the same results (choking on UTF-16), but saved asUTF-8, it works. I just have to make sure that's a Unicode constant if Iwant it to print correctly, as we're at 2.7. Jython has a checkered pastwith encodings, but tries to do exactly the same as CPython 2.7.x.

Now, a fact I haven't mentioned is that my machine was localised tosimplified Chinese (to diagnose some bug) during this test. If Ire-localise to my usual English (UK), I do not get the guidance fromnotepad: instead it quietly saves as Latin-1 (cp1252), perhaps becauseI'm in Western Europe. Python baulks at this, at the first accentedcharacter. If I save from notepad as Unicode or UTF-8 the results are asbefore, including the BOM.

In some circumstances, then, the natural result of using notepad and notsticking to ASCII may be UTF-16-LE with a BOM, or Latin-1 depending onlocalisation, it seems. The Python error message provides a clue what auser should do, but they would need some background, a helpful teacher,or the Internet to sort it out.


Jeff Allen

On 15/11/2015 07:23, Stephen J. Turnbull wrote:

Steve Dower writes:

  > Saying [UTF-16] is rarely used is rather exposing your own
  > unawareness though - it could arguably be the most commonly used
  > encoding (depending on how you define "used").

Because we're discussing the storage of .py files, the relevant
definition is the one used by the Unicode Standard, of course: a
text/plain stream intended to be manipulated by any conformant Unicode
processor that claims to handle text/plain.  File formats with in-band
formatting codes and allowing embedded non-text content like Word, or
operating system or stdlib APIs, don't count.  Nor have I seen UTF-16
used in email or HTML since the unregretted days of Win2k betas[1]
(but I don't frequent Windows- or Java-oriented sites, so I have to
admit my experience is limited in a possibly relevant way).

In Japan my impression is that modern versions of Windows have
Memopad[sic] configured to emit UTF-8-with-signature by default for
new files, and if not, the abomination known as Shift JIS (I'm not
sure if that is a user or OEM option, though).  Never a widechar
encoding (after all, the whole point of Shift JIS was to use an 8-bit
encoding for the katakana syllabary to save space or bandwidth).

I think if anyone wants to use UTF-16 or UTF-32 for exchange of Python
programs, they probably already know how to convert them to UTF-8.  As
somebody already suggested, this can be delegated to the py.exe
launcher, if necessary, AFAICS.

I don't see any good reason for allowing non-ASCII-compatible
encodings in the reference CPython interpreter.

However, having mentioned Windows and Java, I have to wonder about
IronPython and Jython, respectively.  Having never lived in either of
those environments, I don't know what text encoding their users might
prefer (or even occasionally encounter) in Python program source.

Steve

Footnotes:
[1]  The version of Outlook Express shipped with them would emit
"HTML" mail with ASCII tags and UTF-8-encoded text (even if it was
encodable in pure ASCII).  No, it wasn't spam, either, so it probably
really was Outlook Express as it claimed to be in one of the headers.

_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:https://mail.python.org/mailman/options/python-dev/ja.py%40farowl.co.uk


_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

Reply via email to