Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move QTextCodec support out of QtCore) See also: https://www.python.org/dev/peps/pep-0538/ https://www.python.org/dev/peps/pep-0540/
Summary: The change above, while removing QTextCodec from our API, had the side-effect of forcing the locale encoding on Unix to be only UTF-8. This RFC (to be recorded as a QUIP) is meant to discuss how we'll deal with locales on Unix systems on Qt 6. This does not apply to Windows because on Windows we cannot reasonably be expected to use UTF-8 for the 8-bit encoding. There are three questions to be decided: a) What should Qt 6 assume the locale to be, if no locale is set? b) In case a non-UTF-8 locale is set, what should we do? c) Should we propagate our decision to child processes? My personal preference is: a) C.UTF-8 b) override it to force UTF-8 on the same locale c) yes Long explanation: On Unix systems, traditionally, the locale is a factor of multiple environment variables starting with LC_ (matching macro names from <locale.h>), as well as the LANG and LANGUAGES variables. If none of those is set, the C and POSIX standards say that the default locale is "C". Moreover, POSIX says that the "POSIX" locale is "C" and does not have multibyte encodings -- that excludes its encoding from being UTF-8. Most modern Unix-based operating systems do set a reasonable, UTF8-based locale for the user. They've been doing that for about 15 years -- it was in 2003 that this started, when I had to switch from zsh back to bash because zsh didn't support UTF-8 yet, but switched back in 2005 when it gained support. On top of that, some even more recent Unix offerings -- namely, macOS and Android -- enforce that the default (or only!) locale encoding is UTF-8. Right now, Qt faithfully accepts the locale configuration set by the user in the environment. It can do that because it has QTextCodec, which is also backed by either the libiconv routines or by ICU, so it can deal with any encoding. In properly-configured environments, there's no problem. The two Python documents above (PEP-538 and 540) also discuss how Python changed its strategy. I'm proposing that we follow Python and go a little further. What's the problem? The problem is where the locale is not set up properly or it is explicitly overriden. See PEP-538 for examples in containers, but as can be seen from it, Linux will default to "POSIX" or empty, which means Qt will interpret the locale as US-ASCII, which is almost never what is intended. Moreover, because of our use of QString for file names, any name that contains code units above 0x7f will be deemed a filesystem corruption and ignored on directory listing -- they are not representable. Furthermore, it happens quite often that users and tools set LC_ALL to "C" in order to obtain messages in English, so they can be parsed by other tools or to be pasted in emails (every time you see me post an error message from a console, I've done that). There are alternative locales that can be used, like "C.UTF-8", "C.utf8" or "UTF-8", but those depend on the operating system and may not be actually available. Arguing that this is an incorrect setup, while factually correct, does not change the fact that it happens. Questions and options: a) What should Qt 6 assume when the locale is unset or is just "C"? This is the case of a simple environment where the variables are unset or have some legacy system-wide defaults, as well as when the user explicitly sets LC_ALL to "C". The options are: - accept them as-is - assume that C with UTF-8 support was intended The first option is what we have today. And if this is our option, then neither question b or c make sense. The second option implies doing the check in QCoreApplication right after setlocale(LC_ALL, ""): if (strcmp(setlocale(LC_ALL, NULL), "C") == 0) setlocale(LC_CTYPE, "C.UTF-8"); b) What should Qt 6 do if a different locale, other than C, is non-UTF8? This case is not an accident, most of the time. It can happen from time to time that someone is simply testing different languages and forces LC_ALL to something non-default to see what happens. They'll very quickly try the UTF-8 versions. But when it's not an accident, it means it was intended. This is the general state of Unix prior to 2003, when locales like "en_US", "en_GB", "fr_FR", "pt_BR" existed, as well as the 2001-2003 Euro variants "fr_FR@euro", "de_DE@euro", "nl_NL@euro", etc. Options are: - accept them as-is (this is what Python does) - assume that the UTF-8 variant was intended, just not properly set The first option is what we have today, aside from the C locale (question (a)). However, keeping that option working implies keeping either ICU or iconv working in Qt 6 and we might want to get rid of that dependency for codecs. The second option implies modifying the QCoreApplication change above. Instead of explicitly checking for the C locale, we'd use nl_langinfo(CODESET) to find out what codec the locale is expecting. If it's not UTF-8, then we'd compose a new LC_CTYPE locale based on what the locale was and UTF-8. That means we'd transform: "" → "C.UTF-8" "C" → "C.UTF-8" "en_US" → "en_US.UTF-8" "fr_FR@euro" → "fr_FR.UTF-8@euro" "zh_CN.GB18030" → "zh_CN.UTF-8" c) Should we propagate our decision to child processes? It's not possible to propagate choices to any other processes, so the question is only to child ones. Asked differently: should we set our choice in the application environment, so it's inherited by child processes? Child applications written with Qt 6 would not be affected, aside from maybe a negligible load time improvement. But any other applications, including Qt 5 ones, would not make the same choices. If we do not propagate, we could end up with a child process (often helpers) that make different choices as to what command-line arguments or pipes or contents in files mean. Note that we can't affect the *parent* process, so this problem could happen there. Welcome side-effect: other libraries and user's own code in the same process can call setlocale() after QCoreApplication has. It's possible that they, unknowingly, override our choices and change the C library back to an incorrect state. If we do set the environment, this cannot happen. Another side-effect is that in a Qt-based graphical environment, the "right" choice will be propagated anyway, to all child processes. Options are: - yes (this is what Python does) - no Bonus d) should we print a warning when we've made a change? Options are: - yes, for all of them - yes, but only for locales other than "C" - no -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products _______________________________________________ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development