STINNER Victor added the comment:
> The BOM (byte order mark) appears in the standard input stream. When using
> cmd.exe, the BOM is not present. This behavior occurs in CP1252 as well as
> CP65001.
How you do change the console encoding? Using the chcp command?
I'm surprised that you get a UTF-8 BOM when the code page 1252 is used. Can you
please check that sys.stdin.encoding is "cp1252"?
I tested PowerShell with Python 3.5 on Windows 7 with an OEM code page 850 and
ANSI code page 1252:
- by default, the stdin encoding is cp850 (OEM code page) and
os.device_encoding(0) returns "cp850". sys.stdin.readline() does not contain a
BOM.
- when stdin is a pipe (ex: echo "abc"|python ...), the stdin encoding becomes
cp1252 (ANSI code page) because os.device_encoding(0) returns None; cp1252 is
the result of locale.getpreferredencoding(False) (ANSI code page).
sys.stdin.readline() does not contain a BOM.
If I change the console encoding using the command "chcp 65001":
- by default, the stdin encoding = os.device_encoding(0) = "cp65001".
sys.stdin.readline() does not contain a BOM.
- when stdin is a pipe, stdin encoding = locale.getpreferredencoding(False) =
"cp1252" and sys.stdin.readline() *contains* the UTF-8 BOM
Note: The UTF-8 BOM is only written once, before the first character.
So the UTF-8 BOM is only written in one case under these conditions:
- Python is running in PowerShell (The UTF-8 BOM is not written in cmd.exe,
even with chcp 65001)
- sys.stdin is a pipe
- the console encoding was set manually to cp65001
--
It looks like PowerShell decodes the output of the producer program (echo,
type, ...) and then encodes the output to the consumer program (ex: python).
It's possible to change the encoding of the encoder by setting $OutputEncoding
variable. Example to encode to UTF-8 without the BOM:
$OutputEncoding = New-Object System.Text.UTF8Encoding($False)
Example to encode to UTF-8 without the BOM:
$OutputEncoding = [System.Text.Encoding]::UTF8
Using [System.Text.Encoding]::UTF8, sys.stdin.readline() starts with a BOM even
if the console encoding is cp850. If you set the console encoding to 65001
(chcp 65001) and $OutputEncoding to [System.Text.Encoding]::UTF8, you get...
two UTF-8 BOMs... yeah!
I tried different producer programs: [MS-DOS] echo "abc", [PowerShell]
write-output "abc", [MS-DOS] type document.txt, [PowerShell] Get-Content
document.txt, python -c "print('abc')". It doesn't like like using a different
program changes anything. The UTF-8 BOM is added somewhere by PowerShell
between by producer and the consumer programs.
To show the console input and output encodings in PowerShell, type
"[console]::InputEncoding" and "[console]::OutputEncoding".
See also:
http://stackoverflow.com/questions/22349139/utf8-output-from-powershell
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue21927>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com