Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Hi Sorry, it looks like this thread is not progressing in a calm and reasoned manner, the way it was meant to be. And I'm very much to blame. So I apologise for the strong language and passionate opinions. I'm deleting most of what I had written as a reply so we can start over. Let's start with your questions: On Saturday, 16 November 2019 10:50:13 PST André Pönitz wrote: > You have not yet answered > > - why this decision was made You know, I don't know. To be frank, I don't know that a decision *was* made. It all started with a change (see OP) about removing QTextCodec from the API and from QtCore. It seemed reasonable enough but it turned up quite a few kinks that hadn't been predicted. One of them, which may still be a showstopper, is QXmlStreamReader's inability to handle XML data encoded in anything except UTF-8, though a thorough search of all XML files in my system turned up exactly zero such files. I don't know why QTextCodec is being removed. I don't remember any decisions in prior QtCS or this mailing list about removing it. We definitely discussed removing the CJK codecs and their big tables and that can still be done, with no effect in the API, since QTextCodec is backed by ICU's ucnv. We may have discussed removing it, but I don't remember a firm decision. And even if it is firm, after looking at the consequences of doing so, we may want to reverse our decision. Related to that is the discussion of whether UTF-8 is the only acceptable locale on Unix systems. If we don't have QTextCodec, then we have to have something fixed for QString::fromLocal8Bit and it would necessarily be UTF-8. But even if we do have QTextCodec, that's still a reasonable question: should assume it is UTF-8? And should we enforce it? Those were the questions in my OP. > - who did it Considering I don't know a decision *was* made, I don't think we can say who made it. > - what the actual problem to solve was Three things being tackled, all related: 1) QTextCodec in the API I think we cannot do without it, it'll have to stay in one way or another. So the question reduces to whether it should stay in QtCore or be moved to another library. Given the QXmlStreamReader problem above, it's probably best to keep it in QtCore, actually. QTextCodec has some API limitations but they can be fixed. It's not necessary for us to remove it: it's not *that* broken. 2) QtCore size As I said above, removing the legacy codecs we have code for is not a problem. They are already disabled in Qt builds where ICU is present, so we'd additionally remove them from all other builds. Where ICU is present, there's no loss of functionality for user applications, since ICU provides far more codecs than we do. For those without ICU, it stands to reason that the user chose size so they are aware of the limitations. Plus, one can always instantiate their own QTextCodec and add to the list (at least, with today's implementation). If QTextCodec is not in QtCore, then most likely you can't affect how QtCore and almost all other Qt classes decode 8-bit data into QString, including QTextStream. and 3) misconfigured locale systems and filename handling This is probably the biggest problem. As it is right now, when the locale isn't set on a Unix system or if it is explicitly set to C, we *cannot* decode any file names with the 8th bit set. Those file names are considered filesystem corruption. And yet they are quite commonly created by the user outside of English-speaking jurisdictions. Your example of setting LC_ALL (or another environment variable) to force the locale to print something that either can be parsed or shared is one such problematic scenario. On one hand, you may need it to get some older tools to parse output; on the other, it makes Qt applications unable to even see some files exist. > - why LC_*ALL* comes into play Because it's the override. If we decide to override and LC_ALL is set, then we have no choice but to override it. If it is unset, then we can leave it unset too, but may need to override LC_CTYPE. > I get the impression that this thread was not started as an RFC for an > open-ended discussion, but as a staged attempt to provide a figleaf for > a pre-determined decision. That was not the intention. That's why I am re-starting it so we can come back to a reasoned approach. Anyway, the two independent (but related) decisions we need to make are: 1) do we keep QTextCodec in QtCore? 2) do we want to change we handle legacy (non-UTF8) locales? For #2, the sub-questions of the OP apply: a) What should Qt 6 assume the locale to be, if no locale is set? b) In case a non-UTF-8 locale is set, what should we do? c) Should we propagate our decision to child processes? My preferences were: a) C.UTF-8 b) override it to force UTF-8 on the same locale c) yes The reason for my preference in propagating to child processes is so that we have a consistent protocol between
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Fri, Nov 15, 2019 at 05:47:04PM -0800, Thiago Macieira wrote: > On Friday, 15 November 2019 16:23:24 PST André Pönitz wrote: > > > The questions are: > > > 1) do we want to prevent another library from accidentally unsetting it? > > > 2) do we want child processes to use the same? > > > > > > Note the answers for both questions must be the same, for the solution is > > > the same. So either both yeses or both nos. > > > > This "answers for both questions must be the same" requirement is arbitrary. > > > > The fact that one known solution results in same answers to both is in > > no way proof that no other solutions exist. > > I don't see how to prevent another library doing setlocale(LC_ALL, "") from > not overriding Qt's default other than to make setlocale(LC_ALL, "") do what > we want. Since what it does is read the environment, the only solution is to > change the environment. You haven't even explained why this prevention would be needed, what exact bad would happen if you don't do that, and you cannot prevent the other library from setting an explicit locale anyway. With modifying the environment, you just catch the "" case, one out of many, and I'll continue to argue that it's not Qt's business to try even that. > > > Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You > > > can either deal with binary data or with UTF-8 text, there's no middle > > > ground. > > Now that's an interesting twist. > > > > The latest memo I did (not...) get was that codecs are to be moved into a > > separate module. Which is actually ok, as it allows user code using codecs > > to live on with minimal changes, and makes QtCore slimmer, kind of "no-loss > > + win". > > Sure. But that's no different than using ICU or writing your own code to > convert from binary to text. QString will not support it on its own. > > > "Qt 6 will not have support for non-UTF-8 codecs, outside of Windows" is > > definitely news to me. I've not seen this being discussed, neither here nor > > within the part of the company that I usually talk to. > > You just said yourself, above. I did not say that. > If QTextCodec moves to another library, we have no codecs in QtCore. Not having codecs in QtCore does not mean QtCore cannot use codecs. One could have a setup where Qt Core just has the bare minimum, with stubs for other codecs that are used when that QtCodecs lib is linked. Actually that's what I had expected something like that to be the targeted solution once I heard that text codecs move out of QtCore. > > So when and where was this decision made, by whom, and why? > > > > Did that person bother to check e.g. whether Qt Creator uses non-UTF-8 > > codecs in some cases and did that person come to the conclusion that any > > such use is bad and deserves to die? > > Probably not. Why does Qt Creator need other codecs? My guess would be to handle code bases that are not (a subset) of UTF-8. > > > you're arguing that here are broken applications that won't handle > > > C.UTF-8 correctly, without giving as single example. > > > > ... is of course not true: > > > > 1. I did not claim there were "broken" applications that won't handle > >C.UTF-8 "correctly", I claimed that there are applications that react > >differently to C.UTF-8. > > Different behaviour is *exactly* what we want. We want this: Who is 'we'? > $ LC_ALL=C.UTF-8 ls á > ls: cannot access 'á': No such file or directory > > not this: > > $ LC_ALL=C ls á > ls: cannot access ''$'\303\241': No such file or directory If you do not touch the environment, the user gets what he asked for. He will most likely want not to see ''$'\303\241, but if he explicitly asks for it in the environment he sets up, it's not Qt's job to override this. > I thought the argument would be that despite being what we wanted, Who is 'we'? > it would break certain scenarios. But I haven't seen any examples of breakage. > > > gcc produces different output under C and C.UTF-8: > > > > echo x | LC_CTYPE=C gcc -xc - > > :1:1: error: expected '=', ',', ';', 'asm' or '__attribute__' > > at end of input > > > > echo x | LC_CTYPE=C.UTF-8 gcc -xc - > > :1:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ > > at end of input > > > > As an additional twist, this different behaviour does not require fancy > > input, input is plain ASCII in both cases. > > > > Output parsers expecting "'" e.g. to produce a set recommendations how > > to quick-fix such problems in an IDE will break. > > Any application that is parsing GCC output is already setting LC_ALL in the > child process's environment. Not necessarily, and if so, it's rather 'C', not 'C.UTF-8'. > Otherwise, they'd be getting possibly translated > messages and we all know that the order of the messages could be different. > Not to mention that instead of "" or even “” we could see «» or „“. Also the point here is not that the particular case. Each
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Friday, 15 November 2019 00:52:55 PST Eike Ziller wrote: > - You state that as if that were a fact imposed on us from some external > entity, and as if that patch were already in. No, but that's the direction that started this conversation. If we're not going to do that, then the entire discussion is moot. > - I thought QTextCodec will > still be available, even if from a separate module. If that plan has > changed, provide a patch for Qt Creator as well. it will, but we'll probably need a session next week to discuss in what form. If wew remove the codecs we kept and only use ICU, then QTextCodec will have negligible cost and could stay in QtCore. If it stays in QtCore, we still have a question whether QString::fromLocal8Bit shall assume it's UTF-8 on Unix systems. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Friday, 15 November 2019 16:23:24 PST André Pönitz wrote: > > The questions are: > > 1) do we want to prevent another library from accidentally unsetting it? > > 2) do we want child processes to use the same? > > > > Note the answers for both questions must be the same, for the solution is > > the same. So either both yeses or both nos. > > This "answers for both questions must be the same" requirement is arbitrary. > > The fact that one known solution results in same answers to both is in > no way proof that no other solutions exist. I don't see how to prevent another library doing setlocale(LC_ALL, "") from not overriding Qt's default other than to make setlocale(LC_ALL, "") do what we want. Since what it does is read the environment, the only solution is to change the environment. > > Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You > > can either deal with binary data or with UTF-8 text, there's no middle > > ground. > Now that's an interesting twist. > > The latest memo I did (not...) get was that codecs are to be moved into a > separate module. Which is actually ok, as it allows user code using codecs > to live on with minimal changes, and makes QtCore slimmer, kind of "no-loss > + win". Sure. But that's no different than using ICU or writing your own code to convert from binary to text. QString will not support it on its own. > "Qt 6 will not have support for non-UTF-8 codecs, outside of Windows" is > definitely news to me. I've not seen this being discussed, neither here nor > within the part of the company that I usually talk to. You just said yourself, above. If QTextCodec moves to another library, we have no codecs in QtCore. That means the rest of Qt will not support other codecs. > So when and where was this decision made, by whom, and why? > > Did that person bother to check e.g. whether Qt Creator uses non-UTF-8 > codecs in some cases and did that person come to the conclusion that any > such use is bad and deserves to die? Probably not. Why does Qt Creator need other codecs? > > you're arguing that here are broken applications that won't handle > > C.UTF-8 correctly, without giving as single example. > > ... is of course not true: > > 1. I did not claim there were "broken" applications that won't handle >C.UTF-8 "correctly", I claimed that there are applications that react >differently to C.UTF-8. Different behaviour is *exactly* what we want. We want this: $ LC_ALL=C.UTF-8 ls á ls: cannot access 'á': No such file or directory not this: $ LC_ALL=C ls á ls: cannot access ''$'\303\241': No such file or directory I thought the argument would be that despite being what we wanted, it would break certain scenarios. But I haven't seen any examples of breakage. > gcc produces different output under C and C.UTF-8: > > echo x | LC_CTYPE=C gcc -xc - > :1:1: error: expected '=', ',', ';', 'asm' or '__attribute__' > at end of input > > echo x | LC_CTYPE=C.UTF-8 gcc -xc - > :1:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ > at end of input > > As an additional twist, this different behaviour does not require fancy > input, input is plain ASCII in both cases. > > Output parsers expecting "'" e.g. to produce a set recommendations how > to quick-fix such problems in an IDE will break. Any application that is parsing GCC output is already setting LC_ALL in the child process's environment. Otherwise, they'd be getting possibly translated messages and we all know that the order of the messages could be different. Not to mention that instead of "" or even “” we could see «» or „“. Changing the environment of a child process is not going to go away. If you're telling me that you're setting the environment before the Qt application to cope with its brokenness, I will ask why that application hasn't been fixed in the 16 years since UTF-8 environments became a thing. And we can provide a way to force Qt not to set the environment, for those weird cases where you musts deal with broken, proprietary cr#p that won't be fixed until the heat death of the Universe. And I will ask why everyone else must pay a performance price for the sake of those old, broken applications that even the maintainer isn't fixing anymore? > #include > #include > #include > > int main() > { > if (strcmp((setlocale(LC_COLLATE, "")), "C") != 0) > abort(); > } > > runs successfully under LC_ALL="C" and aborts under LC_ALL="C.UTF-8". Strawman example, this doesn't happen in reality. See my exhaustive search for all such checks in an entire Linux distribution. I'm asking for *real* situations. > While contreived in this form, there _is_ code even in Creator checking > for "C" literally, raising the suspicion that this might happen in other > applications, too. Oh, checking for "C" literally does exist, there were several in my search. About half of