On Tue, Dec 15, 2015 at 2:27 PM, Tim Roberts <t...@probo.com> wrote:
>
> The Windows console shell is an 8-bit entity.  That means you only have
> 256 characters available at any given time, similar to they way
> non-Unicode strings work in Python 2.

The input and screen buffers of the console (conhost.exe) are UCS-2,
which has been the case since NT 3.1 was released in 1993. There are
display limits, such as not being able to mix narrow and wide glyphs
and not handling characters composed with multiple codes (such as
UTF-16 surrogate pairs). Regardless of what's displayed, the
wide-character API preserves the underlying UTF-16 text.

That said, handling the input buffer requires special care due to how
it represents characters that aren't mapped by the current keyboard
layout. In this case, the WindowProc of conhost.exe handles a
WM_DROPFILES [1] message as if it's pasted from the clipboard. It
loops over the string to create an INPUT_RECORD [2] array. Each
character is mapped in the current keyboard layout via VkKeyScan [3].
If this fails, the console uses a sequence of Alt+Numpad key event
records. (At the end of this reply I'm including a commented
transcript of a session with a debugger attached to conhost.exe in
Windows 10. I set a breakpoint on s_DoStringPaste to watch how it
handled pasting "À" into the input buffer.)

A client program that calls ReadConsoleW [4] doesn't have to worry
about this. The console internally handles decoding the Alt+Numpad
sequence when it writes the input to the caller's wide-character
buffer. Microsoft's getwch function instead calls ReadConsoleInputW
[5] to be able to read extended keys and avoid discarding non-keyboard
events, but it doesn't handle the Alt+Numpad case. Handling these
sequences requires a custom implementation of kbhit and getwch.

An example that gets this right is the PDCurses [6] library, when
compiled using the wide-character API. Christoph Gohlke has a Python
curses module [7] for Windows that uses PDCurses, but only the Python
3 version is compiled with Unicode support.

If extended key support (e.g. arrow and function keys) and preserving
mouse, window buffer, and focus events doesn't matter, then just
disable the console's line input and echo mode, and call ReadConsoleW
to read a character at a time. This lets the console handle the
Alt+Numpad events for you. Here's example ctypes code for this limited
implementation of kbhit and getwch. It's not broadly tested, so caveat
emptor. I did check that it worked with file drops and pasting Unicode
strings into the console, as well as manual Alt+Numpad input.

    import msvcrt
    import ctypes
    from ctypes import wintypes

    kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

    STD_INPUT_HANDLE = -10
    KEY_EVENT = 1
    VK_MENU = 0x12
    ENABLE_LINE_INPUT = 2
    ENABLE_ECHO_INPUT = 4

    wintypes.CHAR = ctypes.c_char

    class INPUT_RECORD(ctypes.Structure):
        class EVENT_RECORD(ctypes.Union):
            class KEY_EVENT_RECORD(ctypes.Structure):
                class UCHAR(ctypes.Union):
                    _fields_ = (('UnicodeChar', wintypes.WCHAR),
                                ('AsciiChar',   wintypes.CHAR))
                _fields_ = (('bKeyDown',          wintypes.BOOL),
                            ('wRepeatCount',      wintypes.WORD),
                            ('wVirtualKeyCode',   wintypes.WORD),
                            ('wVirtualScanCode',  wintypes.WORD),
                            ('uChar',             UCHAR),
                            ('dwControlKeyState', wintypes.DWORD))
            _fields_ = (('KeyEvent', KEY_EVENT_RECORD),)
        _fields_ = (('EventType', wintypes.WORD),
                    ('Event',     EVENT_RECORD))

    def kbhit():
        handle = kernel32.GetStdHandle(STD_INPUT_HANDLE)
        npend = wintypes.DWORD()
        npeek = wintypes.DWORD()
        if (not kernel32.GetNumberOfConsoleInputEvents(
                    handle, ctypes.byref(npend)) or
            npend.value == 0):
            return False
        inbuf = (INPUT_RECORD * npend.value)()
        if (not kernel32.PeekConsoleInputW(
                    handle, inbuf, npend, ctypes.byref(npeek)) or
            npeek.value == 0):
            return False
        peek = (INPUT_RECORD * npeek.value).from_buffer(inbuf)
        for p in peek:
            if p.EventType != KEY_EVENT:
                continue
            e = p.Event.KeyEvent
            if (e.bKeyDown or (e.wVirtualKeyCode == VK_MENU and
                               e.uChar.UnicodeChar)):
                return True
        return False

    def getwch():
        handle = kernel32.GetStdHandle(STD_INPUT_HANDLE)
        old_mode = wintypes.DWORD()
        if not kernel32.GetConsoleMode(handle, ctypes.byref(old_mode)):
            raise ctypes.WinError(ctypes.get_last_error())
        mode = old_mode.value & ~(ENABLE_LINE_INPUT | ENABLE_ECHO_INPUT)
        kernel32.SetConsoleMode(handle, mode)
        try:
            wc = wintypes.WCHAR()
            n = wintypes.DWORD()
            if not kernel32.ReadConsoleW(
                    handle, ctypes.byref(wc), 1,
                    ctypes.byref(n), None):
                raise ctypes.WinError(ctypes.get_last_error())
            return wc.value
        finally:
            kernel32.SetConsoleMode(handle, old_mode)


> Using msvcrt.getchw does not convert the console to a Unicode entity.
> It merely means the characters you DO get are represented in Unicode.

FYI, the CRT source code is distributed with Visual Studio. For
example, with Windows 10 and Visual Studio 2015, it should be
installed here:

    _getch, _kbhit
    %ProgramFiles(x86)%\Windows Kits\10\Source\10.0.10150.0\ucrt\conio\getch.cpp

    _getwch
    %ProgramFiles(x86)%\Windows
Kits\10\Source\10.0.10150.0\ucrt\conio\getwch.cpp

So there's no mystery about what these functions do. The mystery that
requires digging into the debugger is how conhost.exe implements the
public console API. Thankfully Microsoft's symbol server publishes the
(public) conhost symbols, so it's relatively easy to find interesting
functions to break on.

> The Windows console theoretically supports a UTF-8 code page (chcp
> 65001), and it does fix many of these problems, but there are some
> console apps that won't like it.

The console itself doesn't support codepage 65001 (UTF-8) well at all.
Depending on the version of Windows, conhost.exe (or csrss.exe prior
to Win 7) has several bugs and shortcomings with this codepage. For
example:

    * For reading from the console, all versions I've used
      fail to correctly encode non-ASCII characters as UTF-8
      via WideCharToMultibyte. If you request 10 bytes, it
      attempts to encode 10 characters, which fails for
      non-ASCII UTF-8. Instead of trying to dynamically step
      down the number of characters, it returns to the
      client that it 'successfully' read 0 bytes. This
      generally gets interpreted as EOF. For example,
      Python's REPL quits, and input() raise EOFError.

    * A buffered writer might flush and split a 2-4 byte
      UTF-8 sequence into two separate writes. But the
      console doesn't maintain the state of partially
      written characters (or reads if the above bug wasn't
      there). Instead you'll end up with 2-4 U+FFFD
      replacement characters written to the console.

    * Prior to Windows 8, WriteFile to the console incorrectly
      reports the number of Unicode characters written instead
      of the number of bytes. So buffered writers will loop
      repeatedly writing what they think is the remainder of
      the UTF-8 buffer. This causes a potentially long trail
      of junk text to be printed after every buffered write
      that contains non-ASCII characters.

As mentioned above, here's the debug session with a breakpoint set on
ConhostV2!Clipboard::s_DoStringPaste. (conhostv2.dll was added in
Windows 10, as part of the update of the console interface. It seems
they're modularizing and modernizing the design using C++ classes,
perhaps to accommodate more improvements in future releases?) To
follow this it helps to have a basic understanding of Microsoft's
debugger commands [8] and x64 register usage [9].

    Breakpoint 0 hit
    ConhostV2!Clipboard::s_DoStringPaste:
    00007ffb`11086120 4885c9          test    rcx,rcx
    0:001> pc

Allocate memory for the INPUT_RECORD array:

    ConhostV2!Clipboard::s_DoStringPaste+0x51:
    00007ffb`11086171 ff1501470200    call    qword ptr [
                                        ConhostV2!_imp_RtlAllocateHeap
                                        (00007ffb`110aa878)]
                                        ds:00007ffb`110aa878={
                                        ntdll!RtlAllocateHeap
                                        (00007ffb`1c6aebf0)}

VkKeyScanW returns -1 (0xffff) because the character isn't mapped in
the keyboard layout:

    0:001> pc
    ConhostV2!Clipboard::s_DoStringPaste+0x122:
    00007ffb`11086242 ff1520420200    call    qword ptr [
                                        ConhostV2!_imp_VkKeyScanW
                                        (00007ffb`110aa468)]
                                        ds:00007ffb`110aa468={
                                        USER32!VkKeyScanW
                                        (00007ffb`1a6f6dc0)}
    0:001> p; r rax
    rax=ffffffffffffffff

So convert the character to the closest OEM character to create an
Alt+Numpad sequence. Note that the OEM character is just for the
sequence. The actual Unicode character is stored in the Alt key
(VK_MENU) release event.

    0:001> pc
    ConhostV2!Clipboard::s_DoStringPaste+0x1a9:
    00007ffb`110862c9 e8cecd0000      call    ConhostV2!ConvertToOem
                                              (00007ffb`1109309c)
    0:001> ? @rcx
    Evaluate expression: 437 = 00000000`000001b5
    0:001> du @rdx l1
    000000d3`1935f850  "À"
    0:001> r r9
    r9=000000d31935f840

The closest character to L'À' in codepage 437 is ASCII 'A':

    0:001> p; da d31935f840 l1
    000000d3`1935f840  "A"

Call _itoa_s to get the base 10 representation of the ordinal value of
'A' as the string "65":

    0:001> pc
    ConhostV2!Clipboard::s_DoStringPaste+0x1c0:
    00007ffb`110862e0 ff15d2440200    call    qword ptr [
                                        ConhostV2!_imp__itoa_s
                                        (00007ffb`110aa7b8)]
                                        ds:00007ffb`110aa7b8={
                                        msvcrt!itoa_s
                                        (00007ffb`1c042af0)}
    0:001> ?? (char)@rcx
    char 0n65 'A'
    0:001> r rdx
    rdx=000000d31935f7c4
    0:001> p; da d31935f7c4
    000000d3`1935f7c4  "65"

Create events for entering 6 and 5 on the numeric keypad. The
corresponding wVirtualKeyCode values are VK_NUMPAD6 and VK_NUMPAD5.
Also get the keyboard scan codes by calling MapVirtualKeyW.

Call MapVirtualKeyW to get the wVirtualScanCode for VK_NUMPAD6:

    0:001> pc
    ConhostV2!Clipboard::s_DoStringPaste+0x21c:
    00007ffb`1108633c ff151e410200    call    qword ptr [
                                        ConhostV2!_imp_MapVirtualKeyW
                                        (00007ffb`110aa460)]
                                        ds:00007ffb`110aa460={
                                        USER32!MapVirtualKeyW
                                        (00007ffb`1a6f3e00)}
    0:001> r rcx
    rcx=0000000000000066

Call MapVirtualKeyW to get the wVirtualScanCode for VK_NUMPAD5:

    0:001> pc
    ConhostV2!Clipboard::s_DoStringPaste+0x21c:
    00007ffb`1108633c ff151e410200    call    qword ptr [
                                        ConhostV2!_imp_MapVirtualKeyW
                                        (00007ffb`110aa460)]
                                        ds:00007ffb`110aa460={
                                        USER32!MapVirtualKeyW
                                        (00007ffb`1a6f3e00)}
    0:001> r rcx
    rcx=0000000000000065
    0:001> pc

Write the INPUT_RECORD array to the input buffer.

    ConhostV2!Clipboard::s_DoStringPaste+0x3e8:
    00007ffb`11086508 e87b66ffff      call    ConhostV2!WriteInputBuffer
                                              (00007ffb`1107cb88)

This writes an array with 6 records:

    0:001> r r8
    r8=0000000000000006

Each record is 20 (0x14) bytes.

VK_MENU (0x12) pressed:

    0:001> dw (@rdx + 0*14) la
    000000d3`16dc9010  0001 16df 0001 0000 0001 0012 0038 0000
    000000d3`16dc9020  0002 0000

VK_NUMPAD6 (0x66) pressed:

    0:001> dw (@rdx + 1*14) la
    000000d3`16dc9024  0001 0000 0001 0000 0001 0066 004d 0000
    000000d3`16dc9034  0002 0000

VK_NUMPAD6 (0x66) released:

    0:001> dw (@rdx + 2*14) la
    000000d3`16dc9038  0001 3474 0000 0000 0001 0066 004d 0000
    000000d3`16dc9048  0002 0000

VK_NUMPAD5 (0x65) pressed:

    0:001> dw (@rdx + 3*14) la
    000000d3`16dc904c  0001 0000 0001 0000 0001 0065 004c 0000
    000000d3`16dc905c  0002 0000

VK_NUMPAD5 (0x65) released:

    0:001> dw (@rdx + 4*14) la
    000000d3`16dc9060  0001 0000 0000 0000 0001 0065 004c 0000
    000000d3`16dc9070  0002 0000

VK_MENU (0x12) released; UnicodeChar == U+00C0:

    0:001> dw (@rdx + 5*14) la
    000000d3`16dc9074  0001 0000 0000 0000 0001 0012 0038 00c0
    000000d3`16dc9084  0000 0000

[1]: https://msdn.microsoft.com/en-us/library/bb774303
[2]: https://msdn.microsoft.com/en-us/library/ms683499
[3]: https://msdn.microsoft.com/en-us/library/ms646329
[4]: https://msdn.microsoft.com/en-us/library/ms684958
[5]: https://msdn.microsoft.com/en-us/library/ms684961
[6]: http://pdcurses.sourceforge.net
[7]: http://www.lfd.uci.edu/~gohlke/pythonlibs/#curses
[8]: https://msdn.microsoft.com/en-us/library/ff540507
[9]: https://msdn.microsoft.com/en-us/library/9z1stfyw
_______________________________________________
python-win32 mailing list
python-win32@python.org
https://mail.python.org/mailman/listinfo/python-win32

Reply via email to