Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-21 Thread Bill Allombert
On Sat, Jan 21, 2023 at 12:58:19PM -0800, Russ Allbery wrote:
> Wouter Verhelst  writes:
> > On Fri, Jan 20, 2023 at 05:16:43PM +, Simon McVittie wrote:
> 
> >> Sure, but neither of those actually require us to support GBK or GB
> >> 18030 as a system locale, only as something that iconv() (or whatever
> >> browsers actually use, which is probably their own thing) can convert
> >> into their preferred internal representation (which is almost certainly
> >> UTF-8, UTF-16 or UCS-4).
> 
> > Those files need to be edited *somewhere*. If that somewhere is a Debian
> > desktop, then you also need editors that know how to write such files,
> > etc.
> 
> Both Emacs and vim will edit files in whatever (supported) encoding you
> want, regardless of the locale encoding.  I would assume this is not that
> uncommon of a feature for other editors as well.  This is therefore a bit
> like Simon's web browser example (although may be somewhat less
> transparent, admittedly).

This is true but this is missing an important point: it is usually not possible
to detect the characther encoding of a plain text file.
That is where a default encoding is required.

Cheers,
-- 
Bill. 

Imagine a large red swirl here. 



Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-21 Thread Russ Allbery
Wouter Verhelst  writes:
> On Fri, Jan 20, 2023 at 05:16:43PM +, Simon McVittie wrote:

>> Sure, but neither of those actually require us to support GBK or GB
>> 18030 as a system locale, only as something that iconv() (or whatever
>> browsers actually use, which is probably their own thing) can convert
>> into their preferred internal representation (which is almost certainly
>> UTF-8, UTF-16 or UCS-4).

> Those files need to be edited *somewhere*. If that somewhere is a Debian
> desktop, then you also need editors that know how to write such files,
> etc.

Both Emacs and vim will edit files in whatever (supported) encoding you
want, regardless of the locale encoding.  I would assume this is not that
uncommon of a feature for other editors as well.  This is therefore a bit
like Simon's web browser example (although may be somewhat less
transparent, admittedly).

(Also, if you're editing files written in Chinese, presumably you're using
an editor with good Chinese input support, and thus one that's more likely
to also have good Chinese encoding support.)

-- 
Russ Allbery (r...@debian.org)  



Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-21 Thread Wouter Verhelst
On Fri, Jan 20, 2023 at 05:16:43PM +, Simon McVittie wrote:
> On Fri, 20 Jan 2023 at 09:54:21 -0700, Anthony Fok wrote:
> > supposedly some older Chinese websites are still using "GBK" as
> > encoding, probably something like:
> > 
> >  
> > 
> > which has less than 30,000 characters and thus a very limited subset
> > of Unicode.  And, presumably not everyone has the know how to convert
> > to UTF-8, the Chinese government wants those unable to at least change
> > that meta tag to:
> > 
> >  
> 
> Sure, but neither of those actually require us to support GBK or GB
> 18030 as a system locale, only as something that iconv() (or whatever
> browsers actually use, which is probably their own thing) can convert
> into their preferred internal representation (which is almost certainly
> UTF-8, UTF-16 or UCS-4).

Those files need to be edited *somewhere*. If that somewhere is a Debian
desktop, then you also need editors that know how to write such files,
etc.

Sometimes it's just easier if the whole thing uses the same encoding.

> Analogously, we've never supported using Windows-1252 (Microsoft's
> legacy Latin-1 variant) as a system locale encoding in some hypothetical
> locale like en_US.windows-1252, but HTML documents with
> text/html;charset=windows-1252 still work fine.

Windows-* encodings were native on Windows, and we only needed to
be able to read files that were generated on such systems.

We're talking here instead about a government-mandated encoding that
systems are expected to support; not only to consume data, but also to
*produce* data.

Windows-* encodings never had that attached to them.

-- 
 w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

I will have a Tin-Actinium-Potassium mixture, thanks.



Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-20 Thread Anthony Fok
On Fri, Jan 20, 2023 at 10:15 AM Adam Borowski  wrote:
>
> On Fri, Jan 20, 2023 at 03:57:17PM +0200, Wouter Verhelst wrote:
> > On Thu, Jan 19, 2023 at 11:47:42AM +, Simon McVittie wrote:
> > If the PRC government *requires* a non-UTF-8 encoding for things sold to
> > them, we would be doing our users who want to sell a product containing
> > Debian to the PRC government a disservice by dropping support for it
> > altogether.
>
> It appears to me they require the character _set_ but not encoding: ie,
> that all glyphs can be displayed, they can be entered from keyboard, etc.
> The standard talks a lot about font support, etc.

Hi Adam,

You are correct indeed.  Yes, they worry more about the correct
coverage of characters, especially those that were added in 2022
corresponding to the latest ISO/IEC 10646 standard.

That said, they do require the ability to open, edit, and convert
GB18030-encoded files, but that is at the iconv() / ICU / application
level, but, like you said, they are NOT enforcing the use of
zh_CN.GB18030 as system locale.  (I now stand corrected.)

Incidentally, they have published three pre-recorded webinars thus
far, and I have reuploaded them to YouTube here for easier access for
the rest of the world:


https://www.youtube.com/watch?v=6gByuPXth7s=PLWCc17-QBkRjwhRfvCpxM8ez3b0qWO59a

I have yet to figure out how to add automatic Chinese subtitles and to
have it translated to English.  Maybe some day.  :-)

> > We don't have to ensure it works perfectly out of the box; just that
> > support is achievable with a reasonable amount of work.
>
> Our installer doesn't allow choosing such a locale, thus if indeed the
> encoding not character set is legally required, then we'd need to change
> so before the release.
>
> But I don't expect that to be the case -- a few years ago I played with
> Deepin and don't remember any weird encoding being used.  It would be good
> to either check again, or ask one of their maintainers.

Indeed, and that is what friends on #debian-zh IRC channel are trying
to tell me, and I have personally confirmed that not only Deepin, but
also Red Flag, openKylix, Ubuntu Kylix all use zh_CN.UTF-8 as the
default system locale (see my one of the really long-winded response
in this thread for details.  So you are indeed correct, and I stand
corrected too.  Sorry for the false alarm!  My mindset was still
living in 2002 when zh_CN.GB18030 was assumed to be a requirement by
the industry, but apparently all distros have switched over to
zh_CN.UTF-8 by default.

Debian does still support zh_CN.GB18030 with KDE, LXDE, LXQt,
Cinnamon, MATE, etc., but crashes with GNOME 43 and XFCE at the moment
(at least not on my system), but that's good enough to pass the most
basic GB18030 test.  Just like you correctly observe, "zh_CN.GB18030
as system locale" is not legally required, and thus no need to change
the Debian installer or the locales package for that, so I wouldn't
worry about that for Debian 12.0.  (If the winds do change, we could
hypothetically sneak in the change in, say, Debian 12.1.  And the
myriad of Debian derivatives in China can easily make that change for
basic conformance too.

Cheers,

Anthony



Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-20 Thread Simon McVittie
On Fri, 20 Jan 2023 at 09:54:21 -0700, Anthony Fok wrote:
> supposedly some older Chinese websites are still using "GBK" as
> encoding, probably something like:
> 
>  
> 
> which has less than 30,000 characters and thus a very limited subset
> of Unicode.  And, presumably not everyone has the know how to convert
> to UTF-8, the Chinese government wants those unable to at least change
> that meta tag to:
> 
>  

Sure, but neither of those actually require us to support GBK or GB
18030 as a system locale, only as something that iconv() (or whatever
browsers actually use, which is probably their own thing) can convert
into their preferred internal representation (which is almost certainly
UTF-8, UTF-16 or UCS-4).

Analogously, we've never supported using Windows-1252 (Microsoft's
legacy Latin-1 variant) as a system locale encoding in some hypothetical
locale like en_US.windows-1252, but HTML documents with
text/html;charset=windows-1252 still work fine.

> I have the feeling that many tech-savvy Chinese have already switched
> to UTF-8, but then perhaps in some circles there are lots of legacy
> GB2312/GBK documents or systems that made GB18030 a necessity, as an
> intermediate step to Unicode.

That doesn't seem so far away from how in some English-speaking circles
there are lots of legacy ISO-8859-1, ISO-8859-15 or (more likely)
Windows-1252 documents, and we can cope OK with those via transcoding,
even in UTF-8 system locales.

smcv



Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-20 Thread Adam Borowski
On Fri, Jan 20, 2023 at 03:57:17PM +0200, Wouter Verhelst wrote:
> On Thu, Jan 19, 2023 at 11:47:42AM +, Simon McVittie wrote:
> > Preferring to use Unicode does seem to be the direction that all of
> > computing is going in, as a simplifying assumption - for example W3C
> > advice for HTML is "You should always use the UTF-8 character encoding"[1]
> > - and as we know, things that aren't tested usually don't work. So I
> > think the level of functionality for non-UTF-8 locales and encodings in
> > the software we package is going to decline over time, whether Debian
> > wants it to or not.
> 
> If the world's most populous country uses something that is not UTF-8, I
> think it's safe to say it's being tested, if only by people who will
> file bugs when things go awry.

And we do know there's not a single bug filed with a GB* locale within the
last 12 years.

There's far fewer reports from Chinese people than the population would
indicate: 0.75% of those with locale information, but that's still 3241
reports; I find it implausible that there's a non-negligible number of
users that go with GB* yet not a single of them gave a single bit of
feedback.

> If the PRC government *requires* a non-UTF-8 encoding for things sold to
> them, we would be doing our users who want to sell a product containing
> Debian to the PRC government a disservice by dropping support for it
> altogether.

It appears to me they require the character _set_ but not encoding: ie,
that all glyphs can be displayed, they can be entered from keyboard, etc.
The standard talks a lot about font support, etc.

> We don't have to ensure it works perfectly out of the box; just that
> support is achievable with a reasonable amount of work.

Our installer doesn't allow choosing such a locale, thus if indeed the
encoding not character set is legally required, then we'd need to change
so before the release.

But I don't expect that to be the case -- a few years ago I played with
Deepin and don't remember any weird encoding being used.  It would be good
to either check again, or ask one of their maintainers.

But for now, I gotta run.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Let's make a Debian conference in Yalta, Ukraine.
⢿⡄⠘⠷⠚⠋⠀
⠈⠳⣄



Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-20 Thread Anthony Fok
On Fri, Jan 20, 2023 at 7:27 AM Wouter Verhelst  wrote:
>
> On Thu, Jan 19, 2023 at 11:47:42AM +, Simon McVittie wrote:
> > Preferring to use Unicode does seem to be the direction that all of
> > computing is going in, as a simplifying assumption - for example W3C
> > advice for HTML is "You should always use the UTF-8 character encoding"[1]
> > - and as we know, things that aren't tested usually don't work. So I
> > think the level of functionality for non-UTF-8 locales and encodings in
> > the software we package is going to decline over time, whether Debian
> > wants it to or not.
>
> If the world's most populous country uses something that is not UTF-8, I
> think it's safe to say it's being tested, if only by people who will
> file bugs when things go awry.
>
> If the PRC government *requires* a non-UTF-8 encoding for things sold to
> them, we would be doing our users who want to sell a product containing
> Debian to the PRC government a disservice by dropping support for it
> altogether.
>
> We don't have to ensure it works perfectly out of the box; just that
> support is achievable with a reasonable amount of work.

Thank you Wouter!  That is exactly my thought, although after my
initial message, I have been told that "zh_CN.GB18030 as system
locale" may not be a strict requirement, and thus an explicit UI for
selecting zh_CN.GB18030 as system locale may not be necessary.  A
fellow Chinese DD suggested that some documentation on how to edit
/etc/locale.gen to enable zh_CN.GB18030 or other non-UTF-8 encodings
would likely be sufficient.

That said, if the testing authorities do want zh_CN.GB18030 to be
easily selectable), I think we can always sneak "zh_CN.GB18030" into
the locales configuration interface in a point release.  

Cheers,
Anthony



Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-20 Thread Anthony Fok
On Fri, Jan 20, 2023 at 7:42 AM Bill Allombert  wrote:
>
> On Thu, Jan 19, 2023 at 11:47:42AM +, Simon McVittie wrote:
> > On Wed, 18 Jan 2023 at 16:30:46 -0700, Anthony Fok wrote:
> > > In their mind, GB 18030 encompasses a lot more than just
> > > a character encoding mapping table.  It is the full support package
> > > (including fonts, display, printing, input methods, etc.) for Han
> > > Chinese and all other minority languages used in China.
> >
> > Preferring to use Unicode does seem to be the direction that all of
> > computing is going in, as a simplifying assumption - for example W3C
> > advice for HTML is "You should always use the UTF-8 character encoding"[1]
> > - and as we know, things that aren't tested usually don't work. So I
> > think the level of functionality for non-UTF-8 locales and encodings in
> > the software we package is going to decline over time, whether Debian
> > wants it to or not.

Re-reading Simon's comment again: Yes, UTF-8 is the ideal, but
supposedly some older Chinese websites are still using "GBK" as
encoding, probably something like:

 

which has less than 30,000 characters and thus a very limited subset
of Unicode.  And, presumably not everyone has the know how to convert
to UTF-8, the Chinese government wants those unable to at least change
that meta tag to:

 

where GB18030, being a Unicode Transformation Format, albeit a
somewhat awkward one, would be able to display any characters in
Unicode.

> It is true for everything. Users know how to pick the software that works for 
> their
> environment. It is not relevant that software they do not use do not support 
> their
> environment.
>
> Telling users to switch to UTF-8 because such and such software they never 
> used
> and were never going to use do not support GB18030 does not make sense.

I have the feeling that many tech-savvy Chinese have already switched
to UTF-8, but then perhaps in some circles there are lots of legacy
GB2312/GBK documents or systems that made GB18030 a necessity, as an
intermediate step to Unicode.

(Not so in Taiwan and Hong Kong, they jump straight to UTF-8 from Big5
or Big5-HKSCS.  For better or for worse.)

> It is like saying the Linux console is deprecated because there are Debian
> packages that requires X or Wayland.
>
> Cheers,
> --
> Bill. 
>
> Imagine a large red swirl here.

Thanks for the wonderful discussion, Bill!

Cheers,
Anthony



Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-20 Thread Anthony Fok
On Thu, Jan 19, 2023 at 4:47 AM Simon McVittie  wrote:
>
> On Wed, 18 Jan 2023 at 16:30:46 -0700, Anthony Fok wrote:
> > In their mind, GB 18030 encompasses a lot more than just
> > a character encoding mapping table.  It is the full support package
> > (including fonts, display, printing, input methods, etc.) for Han
> > Chinese and all other minority languages used in China.
>
> If I'm reading correctly, the character encoding part of GB 18030-2022 is
> a subset of a sufficiently new version of Unicode, in the same way that
> (say) ISO-8859-15 is a subset of Unicode: for every character representable
> in GB 18030-2022, you can point at an equivalent Unicode character and say
> "this is the GB 18030-2022 encoding of U+4E00" or similar? Is that true?

If using ISO-8859-15 "legacy encoding" as comparison, in China that
would be the 1980 "GB2312" (GB 2312-80) standard and the 1993 "GBK"
extension.  The character repertoires that these legacy
encodings/charsets contain are far fewer than what Unicode or ISO/IEC
10646 encompasses, and in that sense, they are "subsets of Unicode".

GB18030, on the other hand, is actually a full UTF or Unicode
Transformation Format (i.e. an encoding of *all* Unicode code points),
as in GB18030 maps to all codepoints of Unicode while maintaining
backward compatibility with existing GB2312 and GBK documents, just
like how UTF-8 maps to all codepoints of Unicode while maintaining
backward compatibility with ASCII.

GB18030 encodes characters into 1-byte, 2-byte or 4-byte sequences.
1-byte essentially ASCII; 2-byte: essentially GBK; the 4-byte
sequences give a total of 1,587,600 (126×10×126×10) codepoints which
easily and sufficiently cover Unicode's 1,112,064 (17×65536 − 2048
surrogates) assigned, reserved, and noncharacter code points. (source:
Wikipedia)

Since GB18030 can be used to represent the entirety of all Unicode
code points, I would not call GB18030 a "subset" of Unicode.

And some people like to think of GB18030 as "UTF-GBK", e.g.
http://archives.miloush.net/michkap/archive/2013/03/28/10405914.html

> If that's the case, then supporting text files written in GB 18030
> does not *necessarily* require the internal representation or the
> system locale to be GB 18030, the same way I can still work with legacy
> en_GB.ISO-8859-15 files on my en_GB.UTF-8 system: it could equally well
> be done by using iconv() or equivalent to transcode to UTF-8, UTF-16 or
> UCS-4 on input, doing all text editing operations on that Unicode, and
> then transcoding back into GB 18030 on output. Most language frameworks
> already do this as a matter of API: Qt, Java and Windows tend to work
> with UTF-16 internally, while GLib/GTK uses UTF-8 internally.

Very true.  While GB18030 is another encoding form for Unicode (and
not a subset), indeed we don't need to use GB18030 as the "internal
representation or the system locale", you have put it very nicely.
GB18030 is also somewhat inefficient as a UTF as the required mapping
table and 4-byte conversion algorithm take up far more space and are
quite a bit slower than something as elegant as UTF-8.

> iconv() seems very unlikely to drop support for GB 18030, ISO-8859-15 and
> other non-Unicode encodings altogether. What this bug report is about is
> dropping support for locales whose associated encoding is non-Unicode,
> such as en_GB.ISO-8859-15 and zh_CN.GB18030, so that the data stream
> between a CLI program and the terminal emulator will be assumed to be UTF-8
> instead of ISO-8859-15 or GB18030.

Indeed, and thankfully, Google Chrome, Mozilla Firefox, LibreOffice
supposedly still support the reading (and writing) of GB18030
documents through iconv() or ICU or Qt's encoding conversions.

> The main thing I can see that would be a problem for GB 18030 users
> if the zh_CN.GB18030 locale was dropped is that various programs might
> assume that the locale encoding is the right one to assume when loading
> existing files and unable to guess the encoding, or the right one to
> write into new files by default - and so users who have moved from
> zh_CN.GB18030 to zh_CN.UTF-8 might find themselves unintentionally
> producing new UTF-8 files.

Yes.  These are some of the pains as we transition from legacy
GB2312/GBK encodings towards Unicode, and GB18030 (being a UTF) is
designed as a stepping stone.  But yes, moving to UTF-8 is indeed a
good thing, even in China, as China is not an isolated island.  China
people do value interoperability with the world too.

> Preferring to use Unicode does seem to be the direction that all of
> computing is going in, as a simplifying assumption - for example W3C
> advice for HTML is "You should always use the UTF-8 character encoding"[1]
> - and as we know, things that aren't tested usually don't work. So I
> think the level of functionality for non-UTF-8 locales and encodings in
> the software we package is going to decline over time, whether Debian
> wants it to or not.

Very true, and it is already happening, 

Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-20 Thread Anthony Fok
On Wed, Jan 18, 2023 at 6:30 PM Russ Allbery  wrote:
>
> Anthony Fok  writes:
>
> > I'm not asking you to spend any time working on GB18030; that would be
> > the job of Debian Chinese i18n/L10n team as well as the wider community
> > (glibc, libiconv, Qt, etc.)  All I am asking you is to maintain the
> > status quo, and don't discount anything other than UTF-8 as legacy.
>
> This topic comes up a lot, and I'd love to put something in either Policy
> or the Developer's Reference proactively to at least explain what we know
> about what our users need and to point people at the right questions to
> ask if it's been another decade and they want to standardize on UTF-8
> again.  Do you have an idea of something suitable we should say?

Hey Russ, thank you so much for your message!

Adam, I would like to apologize; while I still value that Debian
maintains its existing support for zh_CN.GB18030 locale, I did speak a
bit too soon.  I'll elaborate.

> I do think we probably should say more *somewhere* about making UTF-8 the
> default choice in most situations if you otherwise have no reason to
> choose anything specific.

I totally agree.  Besides the Debian Policy A fellow DD on #debian-zh
IRC (linked with Telegram) channel suggests that UTF-8 being the
default should be mentioned in the Release Notes and probably with
pointers to fuller documentation, with instructions on how to manually
add locales with legacy and other non-UTF-8 encodings edit
/etc/locale.gen and /etc/default/locale, and run locale-gen.

> For example, as you point out, files written in
> Chinese for Chinese people may or may not want to use UTF-8, but at this
> point I do think anything written in, say, French or German probably
> should just use UTF-8.

Totally agreed.

And I should clarify: Actually, I would say, for the majority of end
users in Mainland China, zh_CN.UTF-8 would still be the best default,
though likely some government and financial institutions may require
the use of zh_CN.GB18030 probably for certain terminal applications.
I don't know the percentage though.

I asked around #debian-zh last night for more feedback, and most
existing users/developers definitely prefer UTF-8 and are using
zh_CN.UTF-8.  Some joked that those who choose zh_CN.GB18030 are the
ones who like to create difficulties for themselves.

And while support for zh_CN.GB18030 as a "system locale" was
apparently a requirement for conformance testing for GB 18030-2000
some twenty years ago — I went through that period personally when
there was a mad dash by all Linux vendors to get that as well as fonts
and input methods working properly — fellow Chinese DDs agree that
could be a requirement 20 years ago, but no longer today, and suggest
that all China homegrown nowadays use LANG=zh_CN.UTF-8 by default, and
apparently still pass the GB 18030(-2005?) conformance tests.  They
suggest that probably having the ability to read and write
GB18030-encoded documents, and being able to convert between UTF-8 and
GB18030 etc. should be sufficient.

I was initially unconvinced, but then after testing in virtual machine
various ISO images from latest releases of China homegrown Linux
distributions, e.g. Deepin Linux, openKylin, and even Red Flag Desktop
Linux, and they all use zh_CN.UTF-8 as the default system locale!
(Red Flag does have zh_CN.gb18030 locale precompiled though, but then
it seems to have all available locales precompiled according to
"locale -a".

Incidentally, Red Flag Desktop Linux is now based on Debian too!  They
used to co-develop the RHEL-based Asianux on which they built their
distro.  What a pleasant surprise!

> Also, file names in the file system shipped in
> Debian packages probably should use UTF-8 since there's no way to declare
> the character set and there are some solid reasons for picking one and
> sticking with it.  (Obviously, users can create files with any character
> set they want.)

Great point!  Totally agreed

> > Debian already supports GB 18030-2000 (or GB 18030-2005) rather well.
>
> How do I configure a locale that uses this as the default character set?
> I'd like to be able to test this configuration (at least for my own
> packages), but since recent changes to locales it doesn't appear to be an
> option in debconf and I was confused trying to figure out how I should
> make it work.

Good question!  I somehow missed that removal of "legacy" encodings
from the locales dpkg configure menu... so that's why Adam was saying
official support for legacy locales have indeed been dropped. (Thanks
Adam!  You're just speaking the facts.)

Anyhow, to test how Debian and various desktop environments run under
zh_CN.GB18030 as system locale, here are the steps:

1. Create the /usr/local/share/i18n/SUPPORTED file with the line

zh_CN.GB18030 GB18030

(I actually started by prepending that line before "zh_CN.UTF-8 UTF-8"
 in /var/lib/dpkg/info/locales.config, but then saw that it has
provision for
 user-provided list of 

Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-20 Thread Bill Allombert
On Thu, Jan 19, 2023 at 11:47:42AM +, Simon McVittie wrote:
> On Wed, 18 Jan 2023 at 16:30:46 -0700, Anthony Fok wrote:
> > In their mind, GB 18030 encompasses a lot more than just
> > a character encoding mapping table.  It is the full support package
> > (including fonts, display, printing, input methods, etc.) for Han
> > Chinese and all other minority languages used in China.
> 
> Preferring to use Unicode does seem to be the direction that all of
> computing is going in, as a simplifying assumption - for example W3C
> advice for HTML is "You should always use the UTF-8 character encoding"[1]
> - and as we know, things that aren't tested usually don't work. So I
> think the level of functionality for non-UTF-8 locales and encodings in
> the software we package is going to decline over time, whether Debian
> wants it to or not.

It is true for everything. Users know how to pick the software that works for 
their
environment. It is not relevant that software they do not use do not support 
their
environment.

Telling users to switch to UTF-8 because such and such software they never used
and were never going to use do not support GB18030 does not make sense.

It is like saying the Linux console is deprecated because there are Debian
packages that requires X or Wayland.

Cheers,
-- 
Bill. 

Imagine a large red swirl here. 



Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-20 Thread Wouter Verhelst
On Thu, Jan 19, 2023 at 11:47:42AM +, Simon McVittie wrote:
> Preferring to use Unicode does seem to be the direction that all of
> computing is going in, as a simplifying assumption - for example W3C
> advice for HTML is "You should always use the UTF-8 character encoding"[1]
> - and as we know, things that aren't tested usually don't work. So I
> think the level of functionality for non-UTF-8 locales and encodings in
> the software we package is going to decline over time, whether Debian
> wants it to or not.

If the world's most populous country uses something that is not UTF-8, I
think it's safe to say it's being tested, if only by people who will
file bugs when things go awry.

If the PRC government *requires* a non-UTF-8 encoding for things sold to
them, we would be doing our users who want to sell a product containing
Debian to the PRC government a disservice by dropping support for it
altogether.

We don't have to ensure it works perfectly out of the box; just that
support is achievable with a reasonable amount of work.

-- 
 w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

I will have a Tin-Actinium-Potassium mixture, thanks.



Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-19 Thread Simon McVittie
On Wed, 18 Jan 2023 at 16:30:46 -0700, Anthony Fok wrote:
> In their mind, GB 18030 encompasses a lot more than just
> a character encoding mapping table.  It is the full support package
> (including fonts, display, printing, input methods, etc.) for Han
> Chinese and all other minority languages used in China.

If I'm reading correctly, the character encoding part of GB 18030-2022 is
a subset of a sufficiently new version of Unicode, in the same way that
(say) ISO-8859-15 is a subset of Unicode: for every character representable
in GB 18030-2022, you can point at an equivalent Unicode character and say
"this is the GB 18030-2022 encoding of U+4E00" or similar? Is that true?

If that's the case, then supporting text files written in GB 18030
does not *necessarily* require the internal representation or the
system locale to be GB 18030, the same way I can still work with legacy
en_GB.ISO-8859-15 files on my en_GB.UTF-8 system: it could equally well
be done by using iconv() or equivalent to transcode to UTF-8, UTF-16 or
UCS-4 on input, doing all text editing operations on that Unicode, and
then transcoding back into GB 18030 on output. Most language frameworks
already do this as a matter of API: Qt, Java and Windows tend to work
with UTF-16 internally, while GLib/GTK uses UTF-8 internally.

iconv() seems very unlikely to drop support for GB 18030, ISO-8859-15 and
other non-Unicode encodings altogether. What this bug report is about is
dropping support for locales whose associated encoding is non-Unicode,
such as en_GB.ISO-8859-15 and zh_CN.GB18030, so that the data stream
between a CLI program and the terminal emulator will be assumed to be UTF-8
instead of ISO-8859-15 or GB18030.

The main thing I can see that would be a problem for GB 18030 users
if the zh_CN.GB18030 locale was dropped is that various programs might
assume that the locale encoding is the right one to assume when loading
existing files and unable to guess the encoding, or the right one to
write into new files by default - and so users who have moved from
zh_CN.GB18030 to zh_CN.UTF-8 might find themselves unintentionally
producing new UTF-8 files.

Preferring to use Unicode does seem to be the direction that all of
computing is going in, as a simplifying assumption - for example W3C
advice for HTML is "You should always use the UTF-8 character encoding"[1]
- and as we know, things that aren't tested usually don't work. So I
think the level of functionality for non-UTF-8 locales and encodings in
the software we package is going to decline over time, whether Debian
wants it to or not.

smcv

[1] https://www.w3.org/International/questions/qa-html-encoding-declarations



Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-18 Thread Russ Allbery
Anthony Fok  writes:

> I'm not asking you to spend any time working on GB18030; that would be
> the job of Debian Chinese i18n/L10n team as well as the wider community
> (glibc, libiconv, Qt, etc.)  All I am asking you is to maintain the
> status quo, and don't discount anything other than UTF-8 as legacy.

This topic comes up a lot, and I'd love to put something in either Policy
or the Developer's Reference proactively to at least explain what we know
about what our users need and to point people at the right questions to
ask if it's been another decade and they want to standardize on UTF-8
again.  Do you have an idea of something suitable we should say?

I do think we probably should say more *somewhere* about making UTF-8 the
default choice in most situations if you otherwise have no reason to
choose anything specific.  For example, as you point out, files written in
Chinese for Chinese people may or may not want to use UTF-8, but at this
point I do think anything written in, say, French or German probably
should just use UTF-8.  Also, file names in the file system shipped in
Debian packages probably should use UTF-8 since there's no way to declare
the character set and there are some solid reasons for picking one and
sticking with it.  (Obviously, users can create files with any character
set they want.)

> Debian already supports GB 18030-2000 (or GB 18030-2005) rather well.

How do I configure a locale that uses this as the default character set?
I'd like to be able to test this configuration (at least for my own
packages), but since recent changes to locales it doesn't appear to be an
option in debconf and I was confused trying to figure out how I should
make it work.

-- 
Russ Allbery (r...@debian.org)  



Bug#1026231: debian-policy: document droppage of support for legacy locales

2023-01-18 Thread Anthony Fok
Hello,

On Mon, Dec 19, 2022 at 2:48 PM Bill Allombert  wrote:
> Which raise the question: does the corresponding user group moved to UTF-8 ?
> Judging from ,
> neither Chinese nor Japanese users have overwhelmingly moved to UTF-8,
> so it would be problematic to stop supporting BIG5, GB18030 and EUC-JP.

Bill, thank you, thank you, thank you!  You speak the voice of reason!

Adam, we living in the West may think of BIG5, GB18030 and EUC-JP as
legacy/obsolete encodings, but in Mainland China, GB18030 is anything
but legacy.  It is a mandatory national standard that has recently
been brought up to date in GB 18030-2022, synchronizing with ISO/IEC
10646:2017 (equivalent to Unicode version 11.0).

"GB 18030 is a national standard with stringent conformance
requirements that regulate eligibility for products or services to be
sold in China."  I personally went through this trying to get the now
defunct ThizLinux distro GB 18030-2000 conformant 20 years ago.  GB
18030-2022 will become mandatory on 2023-08-01.  Why the urgency?  To
add support 17000+ rarer CJK Han characters found in people's and
place names, as well as improving support for minor ethnic languages
in China.  And the GB18030 standard committee is really serious about
promoting GB18030 because they are eager to resolve some real issues
of "missing characters" that are negatively affecting the people
living in China.  To my pleasant surprise, they are putting out a
public lecture webinar series that explains the why and the how of
implementing GB 18030-2022, with the 3rd video published on
2022-12-30.  In their mind, GB 18030 encompasses a lot more than just
a character encoding mapping table.  It is the full support package
(including fonts, display, printing, input methods, etc.) for Han
Chinese and all other minority languages used in China.

See e.g. the following excellent articles for more information:

 * https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132
 * https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf

Even though Debian is not proprietary/commercial software, the GB
18030 authority highly recommends that free/libre and open-source
software _do_ implement GB 18030-2022.  That's especially true
considering the fact that vendors in China may be offering Debian as a
solution for clients, but they would be prevented from doing so if
Debian Policy spells out "We support UTF-8 and UTF-8 only.  Think of
all the ARM and RISC-V single-board computers made in China where
Debian is the default OS image; Debian or derivatives (Ubuntu, Ubuntu
Kylin, etc.) that are pre-installed on PCs sold in China, etc.

As a matter of fact, I have been recently approached recently to
update the IANA charset technical summary for "GB18030" (i.e. the
original GB 18030-2000) in
https://www.iana.org/assignments/charset-reg/GB18030 with the latest
updates for GB 18030-2022.  (Haha, I am starting to fret about it
because I am no expert in GB18030, but many thanks to e.g. Dr. Ken
Lunde, the expert in CJKV information processing, who has kindly
allowed me to borrow any of his articles in updating the IANA charset
documentation for GB18030.

I'm not asking you to spend any time working on GB18030; that would be
the job of Debian Chinese i18n/L10n team as well as the wider
community (glibc, libiconv, Qt, etc.)  All I am asking you is to
maintain the status quo, and don't discount anything other than UTF-8
as legacy.  Debian already supports GB 18030-2000 (or GB 18030-2005)
rather well.  Don't let that existing support die!  If anything, we'd
need to improve GB18030 support to conform with GB 18030-2022, though
thankfully much of that work would likely come from upstream projects
or from Debian derivatives or other distros that are actually selling
their products in China.

Many thanks for your understanding!

Kind regards,

Anthony Fok



Bug#1026231: debian-policy: document droppage of support for legacy locales

2022-12-22 Thread Simon McVittie
On Wed, 21 Dec 2022 at 18:15:11 +0100, Adam Borowski wrote:
> On Mon, Dec 19, 2022 at 07:08:09PM +, Simon McVittie wrote:
> > On Fri, 16 Dec 2022 at 19:21:37 +0100, Adam Borowski wrote:
> > > * The execution environment (usually init system or a container) must
> > >   default to UTF-8 encoding unless explicitly configured otherwise.
> > 
> > Is this already true? This seems like the sort of thing which should be
> > fixed in at least the major init systems and container managers before it
> > goes into Policy, in the interests of not making those init systems and
> > container managers retroactively buggy.
> 
> I'm less knowledgeable about containers, but they appear to work.  It might
> be due to copying variables from the host or having template defaults...

There are three major categories of container that I'm aware of:

1. Full-system containers like lxc/lxd run a complete init system like
   systemd or sysv-rc for the container (they aim to behave like a
   lighter-weight alternative to VMs), so their init system would be
   responsible for making this work. This seems non-problematic for your
   proposed requirement: if an init system does the right thing on bare
   metal or in a VM, it will also do the right thing in lxc containers.

2. Per-service containers like the upstream-recommended use of Docker either
   have no init system at all, or a minimal reaper process like tini
   (they aim to behave like a heavier-weight alternative to chroots).
   chroot managers that have subsequently gained namespace functionality,
   like some uses of pbuilder and schroot, also work like this.
   I think these are the category that is most likely to have trouble
   complying with the requirement you propose, because the container manager
   is intentionally "hands-off" (mechanism more than policy), while the
   processes run inside the container are not under Debian's control (they
   are chosen by whoever wrote the Dockerfile or equivalent, and might come
   from either Debian or another distribution).

3. Per-app containers like Flatpak and Snap are not intended to emulate a
   whole system, so they are expected to inherit locale settings from the
   host system like a non-sandboxed app would. They shouldn't be a problem
   here, as long as your proposed requirement is worded in such a way that
   it is valid for these container managers to rely on the host system
   locale being correct (in other words, if someone using the legacy en_GB
   locale reports a bug "flatpak: does not set a UTF-8 locale", I should be
   able to close it with "This is not flatpak's job, set your host system
   locale to en_GB.UTF-8 instead").

podman and systemd-nspawn can be used as either the first category
(like lxc) or the second (like Docker), depending how they were invoked.

> Anyway, my aim is more to tell packages that they are allowed to misbehave
> when the settings are missing than to hunt misuse scenarios.  But, if such
> a scenario is found, with the current Policy there is no recourse, while
> if this rule is added it would be a bug.

Not every bug necessarily needs to be a Policy violation.

> I just tested Windows 11 notepad.exe: it defaults to UTF-8, and when
> saving it allows "ANSI" "UTF-16 LE" "UTF-16 BE" "UTF-8" (default) and
> "UTF-8 with BOM".

Yes, that's the sort of UX that I think needs to be allowed. I would
personally not expose that choice in the UI of an intentionally simple
text editor like Notepad or gnome-text-editor, but I would expect similar
behaviour in an editor with more elaborate programmer-oriented features,
like vim, emacs, gnome-builder or kate.

If iconv(1) or a similar program has an option for "UTF-8 with BOM" then
that also needs to not be accidentally declared to be a Policy violation.

smcv



Bug#1026231: debian-policy: document droppage of support for legacy locales

2022-12-21 Thread Bill Allombert
On Wed, Dec 21, 2022 at 03:23:09PM +0100, Adam Borowski wrote:
> On Mon, Dec 19, 2022 at 10:44:12PM +0100, Bill Allombert wrote:
> > Which raise the question: does the corresponding user group moved to UTF-8 ?
> > Judging from ,
> > neither Chinese nor Japanese users have overwhelmingly moved to UTF-8,
> > so it would be problematic to stop supporting BIG5, GB18030 and EUC-JP.
> 
> We actually do have data about locale usage in Debian.
> I've copied .report files from bugs-mirror, and
> grep -arm1 ^Locale: */*/*.report
> shows that:
> * most recent use of BIG5 is #925894 from March 2019
> * there's no use of any GB locale (other than en_GB :p) past #609517 (2011)
> * for EUC there's #1001207 (2021) #953616 #939588 #939494 #893625

I do not think bug submitters expect the Locale field to be used for locale
usage statistics, so it does not seem fair to use it for that purpose.

Cheers,
-- 
Bill. 

Imagine a large red swirl here. 



Bug#1026231: debian-policy: document droppage of support for legacy locales

2022-12-21 Thread Adam Borowski
On Mon, Dec 19, 2022 at 07:08:09PM +, Simon McVittie wrote:
> On Fri, 16 Dec 2022 at 19:21:37 +0100, Adam Borowski wrote:
> > As of Bookworm, legacy locales are no longer officially supported.
> 
> For clarity, I think when you say "legacy locales" you mean locales
> whose character encoding is either explicitly or implicitly something
> other than UTF-8 ("legacy national encodings"), like en_US (implicitly
> ISO-8859-1 according to /usr/share/i18n/SUPPORTED) and en_GB.ISO-8859-15
> (explicitly ISO-8859-15 in its name). True?

Aye.

> Many of the non-UTF-8 encodings are single-byte encodings in the
> ISO-8859 family, but if I understand correctly, your reasoning applies
> equally to multi-byte east Asian encodings like BIG5, GB18030 and EUC-JP.
> Also true?

Aye.  Anything but UTF-8.

> Meanwhile, locales with a UTF-8 character encoding, like en_AG
> (implicitly UTF-8 according to /usr/share/i18n/SUPPORTED) or en_US.UTF-8
> (explicitly UTF-8), are the ones you are considering to be non-legacy.
> Also true?

Right.

> I think for Policy use, this would have to say something more precise,
> like "locales with a non-UTF-8 character encoding". I wouldn't want to
> get en_US speakers trying to argue that en_GB.UTF-8 is a legacy locale,
> or en_GB speakers like me trying to argue that en_US.UTF-8 is a legacy
> locale :-)

English (traditional) vs English (simplified) :p

> When you say "officially supported" here, do you refer to the extent
> to which they are supported by the glibc maintainers, or some other
> group? Or are you describing a change request that they *should not*
> be officially supported by Debian - something that is not necessarily
> true yet, but in this bug you are asking for it to become true?

My primary source is glibc, especially the debconf questions from "locales",
although bit-rot and/or outright droppage is widespread in other packages.

> > * Software may assume they always run in an UTF-8 locale, and emit or
> >   require UTF-8 input/output without checking.
> 
> I suspect this is already common: for example, ikiwiki is strictly
> UTF-8-only and ignores locales' character sets, which is arguably a bug
> right now but would become a non-bug with your proposed policy.

Exactly, I want to declare that a non-bug, thus saving developer time.

> This is a "may" so it can't possibly make a package gain bugs. It might
> make packages have fewer bugs.

Aye.

> > * The execution environment (usually init system or a container) must
> >   default to UTF-8 encoding unless explicitly configured otherwise.
> 
> Is this already true? This seems like the sort of thing which should be
> fixed in at least the major init systems and container managers before it
> goes into Policy, in the interests of not making those init systems and
> container managers retroactively buggy.

Systemd does so since version 240, sysvinit relies on settings in /etc/
thus in the case of bare debootstrap the variables might be unset -- which
is mostly moot since glibc 2.35.  We briefly discussed an one-line patch
to ensure there's a fallback default, it's currently not applied (but can
be).  This would be relevant only for corner cases like an unconfigured
system running non-glibc non-musl binaries that rely on LC_*.

I'm less knowledgeable about containers, but they appear to work.  It might
be due to copying variables from the host or having template defaults...

Anyway, my aim is more to tell packages that they are allowed to misbehave
when the settings are missing than to hunt misuse scenarios.  But, if such
a scenario is found, with the current Policy there is no recourse, while
if this rule is added it would be a bug.

> > * Legacy locales are no longer officially supported, and packages may
> >   drop support for them and/or exclude them from their testsuites.
> > * Packages may retain support for legacy locales, but related bug reports
> >   (unless security related) are considered to be of wishlist severity.
> 
> Is the C (aka POSIX) locale still a non-UTF-8 locale (if I remember
> correctly its character encoding is officially 7-bit ASCII), or has it
> been redefined to be UTF-8? Given the special status of the C locale in
> defaults and standards, it might be necessary to say that it's the only
> supported locale with a non-UTF-8 character encoding.

Hmm... if I recall correctly, old POSIX left the behaviour of characters
above 126 undefined, making C.UTF-8 _almost_ match the requirements (with
only exception being iswblank() IIRC), but current version specifies ASCII
(rather than C standard's "portable subset") with no additions to character
classes other than cntrl and punct allowed.

This is the locale all processes start with, until they call setlocale().
I'm still not decided whether it should be allowed as the system locale
(ie, when a process says it wants locale handling enabled).

Having it breaks non-ASCII in GUIs, some text output, causes misalignment,
etc.  Thus maybe we can relegate it to the 

Bug#1026231: debian-policy: document droppage of support for legacy locales

2022-12-21 Thread Adam Borowski
On Mon, Dec 19, 2022 at 10:44:12PM +0100, Bill Allombert wrote:
> Which raise the question: does the corresponding user group moved to UTF-8 ?
> Judging from ,
> neither Chinese nor Japanese users have overwhelmingly moved to UTF-8,
> so it would be problematic to stop supporting BIG5, GB18030 and EUC-JP.

We actually do have data about locale usage in Debian.
I've copied .report files from bugs-mirror, and
grep -arm1 ^Locale: */*/*.report
shows that:
* most recent use of BIG5 is #925894 from March 2019
* there's no use of any GB locale (other than en_GB :p) past #609517 (2011)
* for EUC there's #1001207 (2021) #953616 #939588 #939494 #893625

Thus:
* Chinese encodings are totally dead for being used as a system locale
* Japanese are nearly so

That Wikipedia page presents stuff from 2008 as new developments, thus is
a wee bit outdated...


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Being wise is hard, but wise-ass... ooh, this one I can deliver!
⠈⠳⣄



Bug#1026231: debian-policy: document droppage of support for legacy locales

2022-12-19 Thread Bill Allombert
On Mon, Dec 19, 2022 at 07:08:09PM +, Simon McVittie wrote:
> On Fri, 16 Dec 2022 at 19:21:37 +0100, Adam Borowski wrote:
> > As of Bookworm, legacy locales are no longer officially supported.
> 
> For clarity, I think when you say "legacy locales" you mean locales
> whose character encoding is either explicitly or implicitly something
> other than UTF-8 ("legacy national encodings"), like en_US (implicitly
> ISO-8859-1 according to /usr/share/i18n/SUPPORTED) and en_GB.ISO-8859-15
> (explicitly ISO-8859-15 in its name). True?
> 
> Many of the non-UTF-8 encodings are single-byte encodings in the
> ISO-8859 family, but if I understand correctly, your reasoning applies
> equally to multi-byte east Asian encodings like BIG5, GB18030 and EUC-JP.
> Also true?
> 
> Meanwhile, locales with a UTF-8 character encoding, like en_AG
> (implicitly UTF-8 according to /usr/share/i18n/SUPPORTED) or en_US.UTF-8
> (explicitly UTF-8), are the ones you are considering to be non-legacy.
> Also true?

Which raise the question: does the corresponding user group moved to UTF-8 ?
Judging from ,
neither Chinese nor Japanese users have overwhelmingly moved to UTF-8,
so it would be problematic to stop supporting BIG5, GB18030 and EUC-JP.

Cheers,
-- 
Bill. 

Imagine a large red swirl here. 



Bug#1026231: debian-policy: document droppage of support for legacy locales

2022-12-19 Thread Simon McVittie
On Fri, 16 Dec 2022 at 19:21:37 +0100, Adam Borowski wrote:
> As of Bookworm, legacy locales are no longer officially supported.

For clarity, I think when you say "legacy locales" you mean locales
whose character encoding is either explicitly or implicitly something
other than UTF-8 ("legacy national encodings"), like en_US (implicitly
ISO-8859-1 according to /usr/share/i18n/SUPPORTED) and en_GB.ISO-8859-15
(explicitly ISO-8859-15 in its name). True?

Many of the non-UTF-8 encodings are single-byte encodings in the
ISO-8859 family, but if I understand correctly, your reasoning applies
equally to multi-byte east Asian encodings like BIG5, GB18030 and EUC-JP.
Also true?

Meanwhile, locales with a UTF-8 character encoding, like en_AG
(implicitly UTF-8 according to /usr/share/i18n/SUPPORTED) or en_US.UTF-8
(explicitly UTF-8), are the ones you are considering to be non-legacy.
Also true?

I think for Policy use, this would have to say something more precise,
like "locales with a non-UTF-8 character encoding". I wouldn't want to
get en_US speakers trying to argue that en_GB.UTF-8 is a legacy locale,
or en_GB speakers like me trying to argue that en_US.UTF-8 is a legacy
locale :-)

When you say "officially supported" here, do you refer to the extent
to which they are supported by the glibc maintainers, or some other
group? Or are you describing a change request that they *should not*
be officially supported by Debian - something that is not necessarily
true yet, but in this bug you are asking for it to become true?

> * Software may assume they always run in an UTF-8 locale, and emit or
>   require UTF-8 input/output without checking.

I suspect this is already common: for example, ikiwiki is strictly
UTF-8-only and ignores locales' character sets, which is arguably a bug
right now but would become a non-bug with your proposed policy.

This is a "may" so it can't possibly make a package gain bugs. It might
make packages have fewer bugs.

> * The execution environment (usually init system or a container) must
>   default to UTF-8 encoding unless explicitly configured otherwise.

Is this already true? This seems like the sort of thing which should be
fixed in at least the major init systems and container managers before it
goes into Policy, in the interests of not making those init systems and
container managers retroactively buggy.

> * Legacy locales are no longer officially supported, and packages may
>   drop support for them and/or exclude them from their testsuites.
> * Packages may retain support for legacy locales, but related bug reports
>   (unless security related) are considered to be of wishlist severity.

Is the C (aka POSIX) locale still a non-UTF-8 locale (if I remember
correctly its character encoding is officially 7-bit ASCII), or has it
been redefined to be UTF-8? Given the special status of the C locale in
defaults and standards, it might be necessary to say that it's the only
supported locale with a non-UTF-8 character encoding.

> * Filesystems may be configured to reject file names that are not valid
>   printable UTF-8 encoded Unicode.

To put this in terms of the requirements that Policy puts on packages,
is this really a should/must in disguise: packages should/must not
assume that they can successfully read/write filenames that are not valid
printable UTF-8-encoded Unicode?

This seems like a change with a wider scope: not only is it excluding
filenames in Latin-1 or whatever, it's also excluding filenames with
non-printable characters (tabs, control characters etc.), or with
the UTF-8 representation of a noncharacter like U+FDEF. Perhaps that
should be a change orthogonal to de-supporting the non-UTF-8 locales?

> * Human-readable files outside of packages' private data must be encoded
>   in UTF-8.  This applies especially to files in /usr/share/doc and /etc
>   but applies to eg. executable scripts in /bin or /sbin as well.

It's not immediately obvious to me what "human-readable files" means here.
Text files? Text files in ASCII-compatible encodings? Files intended to be
read and written by standard text editors?

I assume the intention here is to make it a policy violation to ship
documentation, scripts, configuration files, etc. encoded in something
like ISO-8859-1 or EUC-JP?

Is this intended to make it a policy violation to ship documentation, etc.
encoded in UTF-16?

> * So-called BOM (U+FEFF) must not be added to plain-text output, and if
>   present, editors/viewers customarily used for editing code should not
>   hide its presence.

This seems to me like it should perhaps be out-of-scope here, and treated
as a separate change: UTF-8 is still UTF-8, whether it starts with U+FEFF
or not, and I think deprecating en_GB in favour of en_GB.UTF-8 (and so on)
is orthogonal to deprecating the use of a U+FEFF prefix on UTF-8 text.

I think "UTF-8 output" is probably a better scope for this than
"plain-text output": my understanding is that when emitting UTF-16, UCS-2
or UCS-4 it's 

Bug#1026231: debian-policy: document droppage of support for legacy locales

2022-12-16 Thread Bill Allombert
On Fri, Dec 16, 2022 at 07:21:37PM +0100, Adam Borowski wrote:
> Package: debian-policy
> Version: 4.6.1.1
> Severity: wishlist
> 
> Hi!
> As of Bookworm, legacy locales are no longer officially supported.  In order
> to not break testsuites, they're mostly working if you install locales-all,
> and you may manually request their generation by editing /etc/locale.gen --
> but functionality is expected to bit rot and/or be removed in the future.

Hi Adam,

How do you define a legacy locale ?
What do you mean by "officially supported" ?  By whom ?

Cheers,
-- 
Bill. 

Imagine a large red swirl here. 



Bug#1026231: debian-policy: document droppage of support for legacy locales

2022-12-16 Thread Adam Borowski
Package: debian-policy
Version: 4.6.1.1
Severity: wishlist

Hi!
As of Bookworm, legacy locales are no longer officially supported.  In order
to not break testsuites, they're mostly working if you install locales-all,
and you may manually request their generation by editing /etc/locale.gen --
but functionality is expected to bit rot and/or be removed in the future.

Thus, what about spelling this in the Policy?:

* Software may assume they always run in an UTF-8 locale, and emit or
  require UTF-8 input/output without checking.
* The execution environment (usually init system or a container) must
  default to UTF-8 encoding unless explicitly configured otherwise.
* Legacy locales are no longer officially supported, and packages may
  drop support for them and/or exclude them from their testsuites.
* Packages may retain support for legacy locales, but related bug reports
  (unless security related) are considered to be of wishlist severity.
* Filesystems may be configured to reject file names that are not valid
  printable UTF-8 encoded Unicode.
* So-called BOM (U+FEFF) must not be added to plain-text output, and if
  present, editors/viewers customarily used for editing code should not
  hide its presence.
* Human-readable files outside of packages' private data must be encoded
  in UTF-8.  This applies especially to files in /usr/share/doc and /etc
  but applies to eg. executable scripts in /bin or /sbin as well.

Rationale: it takes non-trivial amount of code to support diverse encodings;
Unicode is a strict superset of all legacy charsets thus there's no loss of
functionality by switching to it exclusively.  In todays Unicode world, text
files of other encodings present a barrier to being read by the user.

While data received from outside the network may legitimately use legacy
locales, requiring all of stdin/stdout/stderr and filesystem data to use
UTF-8 would simplify code.  It's not like we pay more than lip service to
other encodings anymore...

While diversity in software is welcome, diversity in standards is not:
UTF-8 will not damage your pinky finger nor require Alt-F2 kill -9 to
exit; will not make your computer fail to boot or require a trip to the
data center; nor infect your K desktop with gnomeitis.  [Of course, there's
no plausible reason to use Postfix, ever!].  In other words, having multiple
phone vendors is essential but having multiple charging connectors is bad.

As for BOM, it is explicitly discouraged by the Unicode Consortium, and can
cause security vulnerabilities where scripts that pass human review act
different than it appears.  #!/bin/perl gets executed by bash, and
this is just one of examples.

As for inits/containers declaring LC_CTYPE=C.UTF-8, systemd has been doing
this for a while, in sysvinit land we debated whether that's still needed
when glibc started to consider unset locale to mean C.UTF-8 rather than C
-- but then, some language compilers do not use glibc.  debootstrap doesn't
configure a default locale, while not all higher-level tools do so,
rendering a system installed in non-standard but reasonable way to lack
the setting, to the surprise of the admin.


Meow!