Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

Giuseppe D'Angelo via Development Sun, 17 Nov 2019 15:13:02 -0800

Il 17/11/19 01:55, Thiago Macieira ha scritto:

Hi


Sorry, it looks like this thread is not progressing in a calm and reasoned
manner, the way it was meant to be. And I'm very much to blame. So I apologise
for the strong language and passionate opinions. I'm deleting most of what I
had written as a reply so we can start over.

Let's start with your questions:

On Saturday, 16 November 2019 10:50:13 PST André Pönitz wrote:

You have not yet answered

   - why this decision was made


You know, I don't know. To be frank, I don't know that a decision *was* made.
It all started with a change (see OP) about removing QTextCodec from the API
and from QtCore. It seemed reasonable enough but it turned up quite a few
kinks that hadn't been predicted. One of them, which may still be a
showstopper, is QXmlStreamReader's inability to handle XML data encoded in
anything except UTF-8, though a thorough search of all XML files in my system
turned up exactly zero such files.

I don't know why QTextCodec is being removed. I don't remember any decisions
in prior QtCS or this mailing list about removing it. We definitely discussed
removing the CJK codecs and their big tables and that can still be done, with
no effect in the API, since QTextCodec is backed by ICU's ucnv. We may have
discussed removing it, but I don't remember a firm decision. And even if it is
firm, after looking at the consequences of doing so, we may want to reverse
our decision.

I don't know either. Is it to make QtCore smaller? Wasn't the feature system ("Qt Lite") supposed to address that? Or is it to make it less of a "kitchen sink", and split it in smaller libraries? Could that mean having QTextCodec in its own library, and QXmlStreamReader in another (that depends on the former)?

Related to that is the discussion of whether UTF-8 is the only acceptable
locale on Unix systems. If we don't have QTextCodec, then we have to have
something fixed for QString::fromLocal8Bit and it would necessarily be UTF-8.
But even if we do have QTextCodec, that's still a reasonable question: should
assume it is UTF-8? And should we enforce it? Those were the questions in my
OP.

Should fromLocal8Bit be following the locale environment instead (LC_CTYPE, LC_MESSAGES or similar)?

2) QtCore size
As I said above, removing the legacy codecs we have code for is not a problem.
They are already disabled in Qt builds where ICU is present, so we'd
additionally remove them from all other builds. Where ICU is present, there's
no loss of functionality for user applications, since ICU provides far more
codecs than we do. For those without ICU, it stands to reason that the user
chose size so they are aware of the limitations. Plus, one can always
instantiate their own QTextCodec and add to the list (at least, with today's
implementation).

If QTextCodec is not in QtCore, then most likely you can't affect how QtCore
and almost all other Qt classes decode 8-bit data into QString, including
QTextStream.

See above -- it also means QTextStream goes in some I/O lib that contains or depends on the codecs lib.

and 3) misconfigured locale systems and filename handling
This is probably the biggest problem. As it is right now, when the locale
isn't set on a Unix system or if it is explicitly set to C, we *cannot* decode
any file names with the 8th bit set. Those file names are considered
filesystem corruption. And yet they are quite commonly created by the user
outside of English-speaking jurisdictions.

Why do we bother about "saving the world"? A misconfigured system is the user's mistake. They should be in charge of fixing it in order to address the problem.

I get the impression that this thread was not started as an RFC for an
open-ended discussion, but as a staged attempt to provide a figleaf for
a pre-determined decision.


That was not the intention. That's why I am re-starting it so we can come back
to a reasoned approach.

Anyway, the two independent (but related) decisions we need to make are:
1) do we keep QTextCodec in QtCore?
2) do we want to change we handle legacy (non-UTF8) locales?

For #2, the sub-questions of the OP apply:
  a) What should Qt 6 assume the locale to be, if no locale is set?
  b) In case a non-UTF-8 locale is set, what should we do?
  c) Should we propagate our decision to child processes?

My preferences were:
  a) C.UTF-8
  b) override it to force UTF-8 on the same locale
  c) yes


How about

a) either C / C.UTF-8, but warning the user; but I'd up the ante, and say: just assert/crash.

b) keep the choice. Silently changing it sounds like a bad idea; we should never override the user choices silently.

c) no. We shouldn't "fix" subprocesses. They have the right to make their own independent decisions.

But I think we should. My arguments are that UTF-8 locales are the default in
all desktop Linux distributions, all BSDs and on macOS and have been for 15
years. Most embedded systems from the last 5 years at least also have it as
the default, especially those with graphical HMIs and most especially those
using Qt for that. Any applications that had problems with UTF-8 must have
been fixed for a long time and those that didn't are almost certainly launched
from wrappers that set a suitable environment for them, either via
QProcessEnvironment, execle, a shell script, or some other mechanism.

Or, on the other hand: what is the chance that a system comes without a locale set? What is more likely to conclude, that it's an accident or a deliberate setting? If it's an accident, why not being *very* verbose about it?

Moreover, setting the locale to non-UTF-8 on a Qt 4 or 5 application on a
system with UTF-8-encoded file names is just *wrong* and asking for trouble,
for the filesystem reasons stated above. Just as an example, think of an
embedded system with a multimedia player that reads a FAT32-formatted USB
stick: it wouldn't go very far if it couldn't even see the music files with
non-ASCII characters in them. So I feel confident when I say applications
targetting porting to Qt 6 are not subject to that problem. Therefore, our
resetting of the environment inside the Qt 6 application is not going to
affect the chiid processes.

But if we disagree and think we shouldn't qputenv, I still think we should
assume by default the locale *is* UTF-8, even if the environment tells us it
isn't (an explict LANG=ja_JP for example, but much more commonly an LC_ALL=C
override). The changing of the encoding is usually an undesired side-effect,
not an intentional choice. That is to say, LANG=ja_JP was actually meant to be
LANG=ja_JP.UTF-8 and LC_ALL=C could have been for the parsing reasons you
brought up. If we don't do the qputenv(), we'll still setlocale() in
QCoreApplication so qt_error_string() produces output and we'll live with the
danger that some code does our choice. My search through Linux library code
found no instance of a permanent setlocale() call with a non-null second
parameter (Qt is actually the only exception).


Qt is a "framework", not a "library". :-)


--
Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts

smime.p7s
Description: Firma crittografica S/MIME

_______________________________________________
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

Reply via email to