Hi Thiago, Thanks for the comprehensive mail.
> On 31 Oct 2019, at 22:11, Thiago Macieira <thiago.macie...@intel.com> wrote: > > Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move > QTextCodec support out of QtCore) > See also: https://www.python.org/dev/peps/pep-0538/ > https://www.python.org/dev/peps/pep-0540/ > > Summary: > The change above, while removing QTextCodec from our API, had the side-effect > of forcing the locale encoding on Unix to be only UTF-8. This RFC (to be > recorded as a QUIP) is meant to discuss how we'll deal with locales on Unix > systems on Qt 6. This does not apply to Windows because on Windows we cannot > reasonably be expected to use UTF-8 for the 8-bit encoding. I do not think we have to worry about the local 8 bit encoding on Windows anymore these days. All our interaction with the OS goes through the 16 bit APIs (ie. uses UTF-16). I don’t think file content is a huge issue neither anymore as Windows 10 seems to have added UTF-8 support to most of it’s tools. Afaik, we can also use a Unicode API for console and debug output, so the only piece that’s left might be our users interacting with legacy ANSI APIs. That should be a rare case and it should be straightforward to port that over to use the Unicode API instead. > > There are three questions to be decided: > a) What should Qt 6 assume the locale to be, if no locale is set? > b) In case a non-UTF-8 locale is set, what should we do? > c) Should we propagate our decision to child processes? > > My personal preference is: > a) C.UTF-8 > b) override it to force UTF-8 on the same locale > c) yes I agree with all three choices. For your bonus (d) below, I’d say we should print a warning if we encounter a non UTF-8 locale other than C. Cheers, Lars > > Long explanation: > > On Unix systems, traditionally, the locale is a factor of multiple > environment > variables starting with LC_ (matching macro names from <locale.h>), as well > as > the LANG and LANGUAGES variables. If none of those is set, the C and POSIX > standards say that the default locale is "C". Moreover, POSIX says that the > "POSIX" locale is "C" and does not have multibyte encodings -- that excludes > its encoding from being UTF-8. > > Most modern Unix-based operating systems do set a reasonable, UTF8-based > locale for the user. They've been doing that for about 15 years -- it was in > 2003 that this started, when I had to switch from zsh back to bash because > zsh > didn't support UTF-8 yet, but switched back in 2005 when it gained support. > On > top of that, some even more recent Unix offerings -- namely, macOS and > Android > -- enforce that the default (or only!) locale encoding is UTF-8. > > Right now, Qt faithfully accepts the locale configuration set by the user in > the environment. It can do that because it has QTextCodec, which is also > backed by either the libiconv routines or by ICU, so it can deal with any > encoding. In properly-configured environments, there's no problem. > > The two Python documents above (PEP-538 and 540) also discuss how Python > changed its strategy. I'm proposing that we follow Python and go a little > further. > > What's the problem? > > The problem is where the locale is not set up properly or it is explicitly > overriden. See PEP-538 for examples in containers, but as can be seen from > it, > Linux will default to "POSIX" or empty, which means Qt will interpret the > locale as US-ASCII, which is almost never what is intended. Moreover, because > of our use of QString for file names, any name that contains code units above > 0x7f will be deemed a filesystem corruption and ignored on directory listing > -- they are not representable. > > Furthermore, it happens quite often that users and tools set LC_ALL to "C" in > order to obtain messages in English, so they can be parsed by other tools or > to be pasted in emails (every time you see me post an error message from a > console, I've done that). There are alternative locales that can be used, > like > "C.UTF-8", "C.utf8" or "UTF-8", but those depend on the operating system and > may not be actually available. > > Arguing that this is an incorrect setup, while factually correct, does not > change the fact that it happens. > > Questions and options: > > a) What should Qt 6 assume when the locale is unset or is just "C"? > > This is the case of a simple environment where the variables are unset or > have > some legacy system-wide defaults, as well as when the user explicitly sets > LC_ALL to "C". The options are: > - accept them as-is > - assume that C with UTF-8 support was intended > > The first option is what we have today. And if this is our option, then > neither question b or c make sense. > > The second option implies doing the check in QCoreApplication right after > setlocale(LC_ALL, ""): > if (strcmp(setlocale(LC_ALL, NULL), "C") == 0) > setlocale(LC_CTYPE, "C.UTF-8"); > > b) What should Qt 6 do if a different locale, other than C, is non-UTF8? > > This case is not an accident, most of the time. It can happen from time to > time that someone is simply testing different languages and forces LC_ALL to > something non-default to see what happens. They'll very quickly try the UTF-8 > versions. But when it's not an accident, it means it was intended. This is > the > general state of Unix prior to 2003, when locales like "en_US", "en_GB", > "fr_FR", "pt_BR" existed, as well as the 2001-2003 Euro variants > "fr_FR@euro", > "de_DE@euro", "nl_NL@euro", etc. Options are: > > - accept them as-is (this is what Python does) > - assume that the UTF-8 variant was intended, just not properly set > > The first option is what we have today, aside from the C locale (question > (a)). However, keeping that option working implies keeping either ICU or > iconv > working in Qt 6 and we might want to get rid of that dependency for codecs. > > The second option implies modifying the QCoreApplication change above. > Instead > of explicitly checking for the C locale, we'd use nl_langinfo(CODESET) to > find > out what codec the locale is expecting. If it's not UTF-8, then we'd compose > a > new LC_CTYPE locale based on what the locale was and UTF-8. That means we'd > transform: > > "" → "C.UTF-8" > "C" → "C.UTF-8" > "en_US" → "en_US.UTF-8" > "fr_FR@euro" → "fr_FR.UTF-8@euro" > "zh_CN.GB18030" → "zh_CN.UTF-8" > > c) Should we propagate our decision to child processes? > > It's not possible to propagate choices to any other processes, so the > question > is only to child ones. Asked differently: should we set our choice in the > application environment, so it's inherited by child processes? > > Child applications written with Qt 6 would not be affected, aside from maybe > a > negligible load time improvement. But any other applications, including Qt 5 > ones, would not make the same choices. If we do not propagate, we could end > up > with a child process (often helpers) that make different choices as to what > command-line arguments or pipes or contents in files mean. > > Note that we can't affect the *parent* process, so this problem could happen > there. > > Welcome side-effect: other libraries and user's own code in the same process > can call setlocale() after QCoreApplication has. It's possible that they, > unknowingly, override our choices and change the C library back to an > incorrect state. If we do set the environment, this cannot happen. > > Another side-effect is that in a Qt-based graphical environment, the "right" > choice will be propagated anyway, to all child processes. > > Options are: > - yes (this is what Python does) > - no > > Bonus d) should we print a warning when we've made a change? > > Options are: > - yes, for all of them > - yes, but only for locales other than "C" > - no > > -- > Thiago Macieira - thiago.macieira (AT) intel.com > Software Architect - Intel System Software Products > > > > _______________________________________________ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development _______________________________________________ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development