Hi all,

For Qt6, we wanted to finalise missing holes in our Unicode story. One of the 
changes that we’ve already decided upon at the Contributor Summit was to 
enforce UTF8 based locales on Unix systems (they’ve been the default there for 
the last 10-15 years).

I would really like to make this 100% cross-platform, so I sat down and did a 
bit of research on Windows. Things are looking pretty promising, at least since 
Windows 10, build 1903.

Windows actually has two 8bit code pages that you need to take care of when 
writing an application. There’s the application code page (which can be 
retrieved by the GetACP() method (1).  The application code page could (until 
build 1903) only be changed for the system as a whole, but can now also be 
changed on a per application basis using a manifest file (2). 

The console has a separate encoding, that can be changed programmatically from 
within the app (3). These code pages can be different from each other, making 
things interesting.

I’ve been running a couple of tests using some a simple test program (4) and a 
manifest file to change the ACP to utf8 as described in (3). Here are the 
results:

* On my machine, the default output code page for the console is CP850. My 
default ansi codepage is Windows-1252. So they don't agree, and writing to the 
console using toLocal8Bit() can/will lead to mojibake in some cases. As such 
our current handling is already broken, as we always convert to loca8bit using 
the application code page also for stdin/out/err.
* Setting the applications code page to utf8 using a manifest works nicely, and 
GetACP() returns 65001 (UTF8) instead of 1251.
* SetConsoleOutputCP(CP_UTF8) also works reliably. Writing utf8 based data to 
stdout, or even using _write(1, data, len) gives the correct output on the 
console. This even works when writing one char at a time (ie. incomplete utf8 
sequences).

* With our current handling, none of the test strings show up correctly on the 
console. 
* setting the output code page makes writing to stdout/stderr with toUtf8() 
work correctly. qDebug() is still not working
* setting the manifest in addition will also make qDebug() work correctly.
* QTextStream still delivers mojibake in all cases. I assume there’s a bug 
somewhere in the way we handle things in QTextStream, this needs some debugging.

Conclusions:

So to me this looks like you can get a 100% utf8/utf16 setup for your apps that 
is compatible with what we do on Unix starting with Windows 10 build 1903 or 
later. To get this fully working, the requirements for Qt would then be to call 
SetConsoleOutputCP(CP_UTF8) + SetConsoleCP(CP_UTF8) and the build system needs 
to add a manifest file that sets the application code page to utf8.

As the code page for the console is not compatible with the ansi code page, I 
don't see why we shouldn't change the console code pages in any case. In 
addition, I think we should add the manifest file to the app through the build 
system by default (and offer a switch to turn both the console code page and 
manifest handling off).

I think this would be mostly a positive change and won't break too many things 
for our users. Most Qt apps don't make heavy use of the 8bit APIs. If they do, 
they need to be prepared to handle different code pages anyway, so changing to 
utf8 should not break anything for them. Filenames are encoded in utf16 on 
NTFS, so setting the ACP to utf8 would make all files accessible by the 8bit 
APIs. 

Other than that I can't really think of many potential issues, as the main 
Windows APIs are usually the 16bit APIs, and the 8bit ones are only wrappers 
around those.

Comments? Am I missing something vital?

Cheers,
Lars

(1) https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp
(2) 
https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page
(3) https://docs.microsoft.com/en-us/windows/console/setconsoleoutputcp
(4)

#include <qdebug.h>
#include <Windows.h>
#include <io.h>

int main(int, char**)
{
    uint cp = GetConsoleOutputCP();
    qDebug() << "COnsole CP" << cp << GetACP();
    SetConsoleOutputCP(CP_UTF8);
    qDebug() << "Hello" << QString::fromUtf8("Ελληνικά");
    printf("Ελληνικά\n");
    QByteArray greek = "Ελληνικά\n";
    fprintf(stdout, greek.constData());
    for (char c : greek)
        fprintf(stdout, "%c", c);
    _write(1, greek.constData(), greek.length());
    for (char c : greek)
        _write(1, &c, 1);

    QTextStream ts(stdout);
    ts.setEncoding(QStringConverter::Utf8);
    ts << QString::fromUtf8(greek);
    SetConsoleOutputCP(cp);
    return 0;
}


_______________________________________________
Development mailing list
[email protected]
https://lists.qt-project.org/listinfo/development

Reply via email to