On 05Sep2016 1234, eryk sun wrote:
Also, the console is UCS-2, which can't be transcoded between UTF-16
and UTF-8. Supporting UCS-2 in the console would integrate nicely with
the filesystem PEP. It makes it always possible to print
os.listdir('.'), copy and paste, and read it back without data loss.

Supporting UTF-8 actually works better for this. We already use surrogatepass explicitly (on the filesystem side, with PEP 529) and implicitly (on the console side, using the Windows conversion API).

It would probably be simpler to use UTF-16 in the main pipeline and
implement Martin's suggestion to mix in a UTF-8 buffer. The UTF-16
buffer could be renamed as "wbuffer", for expert use. However, if
you're fully committed to transcoding in the raw layer, I'm certain
that these problems can be addressed with small buffers and using
Python's codec machinery for a flexible mix of "surrogatepass" and
"replace" error handling.

I don't think it actually makes things simpler. Having two buffers is generally a bad idea unless they are perfectly synced, which would be impossible here without data corruption (if you read half a utf-8 character sequence and then read the wide buffer, do you get that character or not?).

Writing a partial character is easily avoidable by the user. We can either fail with an error or print garbage, and currently printing garbage is the most compatible behaviour. (Also occurs on Linux - I have a VM running this week for testing this stuff.)

Cheers,
Steve
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to