Eryk Sun <[email protected]> added the comment:
I think this is a locale configuration problem, in which the locale encoding
doesn't match the terminal encoding. If so, it can be closed as not a bug.
> export a="中文"
In POSIX, the shell reads "中文" from the terminal as bytes encoded in the
terminal encoding, which could be UTF-8 or some legacy encoding. The value of
`a` is set directly as this encoded text. There is no intermediate
decode/encode stage in the shell. For a child process that decodes the value of
the environment variable, as Python does, the locale's LC_CTYPE encoding should
be the same or compatible with the terminal encoding.
> job_name = os.environ['a']
> print(job_name)
In POSIX, sys.stdout.errors, as used by print(), will be "surrogateescape" if
the default LC_CTYPE locale is a legacy locale -- which in 3.6 is the case for
the "C" locale, since it's usually limited to 7-bit ASCII. "surrogateescape" is
also the errors handler for decoding bytes os.environb (POSIX) as text
os.environ. When decoding, "surrogateescape" handles non-ASCII byte values that
can't be decoded by translating the value into the reserved surrogate range
U+DC80 - U+DCFF. When encoding, it translates each surrogate code back to the
original byte value in the range 0x80 - 0xFF.
Given the above setup, byte sequences in os.environb that can't be decoded with
the default LC_CTYPE locale encoding will be surrogate escaped in the decoded
text The surrogate-escaped values roundtrip back to bytes when printed,
presumably as the terminal encoding.
> with open('name.txt', 'w', encoding='utf-8')as fw:
> fw.write(job_name)
The default errors handler for open() is "strict" instead of "surrogateescape",
so the surrogate-escaped values in job_name cause the encoding to fail.
> Your code runs for me on Windows
In Windows, Python uses the wide-character (16-bit wchar_t) environment of the
process for os.environ, and, in 3.6+, it uses the console session's
wide-character API for console files such as sys.std* when they aren't
redirected to a pipe or disk file. Conventionally, wide-character strings
should be valid UTF-16LE text. So getting "中文" from os.environ and printing it
should 'just work'. The output will even be displayed correctly if the console
session uses a font that supports "中文", or if it's a pseudoconsole (conpty)
session that's attached to a terminal that supports automatic font fallback,
such as Windows Terminal.
----------
components: +IO, Interpreter Core, Library (Lib), Unicode -C API
nosy: +eryksun, ezio.melotti, vstinner
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue43576>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com