Re: [Python-Dev] PEP 528: Change Windows console encoding to UTF-8

Steve Dower Mon, 05 Sep 2016 12:34:26 -0700

On 05Sep2016 1110, Paul Moore wrote:

On 5 September 2016 at 18:38, Steve Dower <[email protected]> wrote:

Can you provide an example of how I'd rewrite the code that I quoted
previously to follow this advice? Note - this is not theoretical, I
expect to have to provide a PR to fix exactly this code should this
change go in. At the moment I can't find a way that doesn't impact the
(currently working and not expected to need any change) Unix version
of the code, most likely I'll have to add buffering of 4-byte reads
(which as you say is complex).


The easiest way to follow it is to use "sys.stdin.buffer.read(1)" rather
than "sys.stdin.buffer.raw.read(1)".


I may have got confused here. If I say sys.stdin.buffer.read(1),
having first checked via kbhit() that there's a character
available[1], then I will always get 1 byte returned, never the
"buffer too small to return a full character" error that you talk
about in the PEP? If so, then I don't understand when the error you
propose will be raised (unless your comment here is based on what you
say below that we'll now buffer and therefore the error is no longer
needed).

I don't think using buffer.read and kbhit together is going to bereliable anyway, as you may not have read everything that's alreadybuffered yet. It's likely feasible if you flush everything, butotherwise it's a bit messy.

One thing I did think of, though - if someone *is* working at the raw
IO level, they have to be prepared for the new "buffer too small to
return a full character" error. That's OK. But what if they request
reading 7 bytes, but the input consists of 6 character s that encode
to 1 byte in UTF-8, followed by a character that encodes to 2 bytes?
You can return 6 bytes, that's fine - but you'll presumably still need
to read the extra character before you can determine that it won't fit
- so you're still going to have to buffer to some degree, surely? I
guess this is implementation details, though - I'll try to find some
time to read the patch in order to understand this. It's not something
that matters in terms of the PEP anyway, it's an implementation
detail.

If you do raw.read(7), we internally do "7 / 4" and decide to only readone wchar_t from the console. So the returned bytes will be between 1and 4 bytes long and there will be more info waiting for next time you ask.

The only case we can reasonably handle at the raw layer is "n / 4" iszero but n != 0, in which case we can read and cache up to 4 bytes (onewchar_t) and then return those in future calls. If we try to cache anymore than that we're substituting for buffered reader, which I don'twant to do.

Does caching up to one (Unicode) character at a time sound reasonable? Ithink that won't be much trouble, since there's no interference betweensystem calls in that case and it will be consistent with POSIX behaviour.


Cheers,
Steve

_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 528: Change Windows console encoding to UTF-8

Reply via email to