Bug#603914: Please drop non-UTF8 locales
Roger Leigh dixit: From my reading of the standards a UTF-8 C locale would be required to behave identically to the existing ASCII C locale: • will consider all byte sequences valid I think it wouldn’t (since UTF-8 mbrtowc/wcrtomb don’t work this way, and it can’t be done with “just” the POSIX API anyway because they aren’t allowed to not read any input byte when outputting (in MirBSD, I’ve added a sister func- tion to mbrtowc which can do that), so not everything can be accepted in all situations. bye, //mirabilos -- “Having a smoking section in a restaurant is like having a peeing section in a swimming pool.” -- Edward Burr -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#603914: Please drop non-UTF8 locales
On Sun, Jan 09, 2011 at 10:21:50PM +, Thorsten Glaser wrote: Roger Leigh dixit: From my reading of the standards a UTF-8 C locale would be required to behave identically to the existing ASCII C locale: • will consider all byte sequences valid I think it wouldn’t (since UTF-8 mbrtowc/wcrtomb don’t work this way, and it can’t be done with “just” the POSIX API anyway because they aren’t allowed to not read any input byte when outputting (in MirBSD, I’ve added a sister func- tion to mbrtowc which can do that), so not everything can be accepted in all situations. If you are using multibyte functions, then I agree these are special cases. For these to function correctly, they do require valid input. They would of course fail when run in a UTF-8 C locale. However, they should fail in an ASCII C locale as well (I should test this) given that the wide character representation is always UCS-4 on GNU/Linux and an e.g. latin1 sequence wouldn't be valid UTF-8. I think the all byte sequences valid applies mainly to narrow character I/O. i.e. printf/puts etc. won't alter, drop or otherwise mangle any non 7-bit-ASCII codes. i.e. I think the intent was to ensure 8-bit cleanliness in a 7-bit locale. This naturally extends to UTF-8. I'm not sure that wide character support is implied here, given that it implicity requires correct byte sequences to function where the narrow character I/O does not (all 8-bit codes are correct). Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. signature.asc Description: Digital signature
Bug#603914: Please drop non-UTF8 locales
Roger Leigh dixit: I think the all byte sequences valid applies mainly to narrow character I/O. i.e. printf/puts etc. won't alter, drop or otherwise mangle any non 7-bit-ASCII codes. i.e. I think the intent was to ensure 8-bit cleanliness in a 7-bit locale. This naturally extends to UTF-8. I'm not sure that wide character support is implied here, given that it implicity requires correct byte sequences to function where the narrow character I/O does not (all 8-bit codes are correct). I was thinking in terms of programmes doing operation on wide characters internally (for example, tr was the first one I switched to wide charac- ters, since in MirBSD they use 16 bit, and the table driven design con- tinued to work; this is also where I noticed the problem). Those are the programmes you want to be aware of: they _are_ internationalised, thus use wchar_t and multibytes and narrow I/O, or wchar_t and wide I/O, and these will benefit from the C.UTF-8 locale; others (that just run on byte strings as if they were characters) don’t see a difference between it and the classical C locale anyway. What I mean is, we try to use C.UTF-8 in places where we want to run on text in UTF-8 but otherwise keep the normed predictable uniform behaviour of C; in places where we operate on binary data C is pro- bably more useful. Hum. Do I make any sense? Goodnight, //mirabilos -- “It is inappropriate to require that a time represented as seconds since the Epoch precisely represent the number of seconds between the referenced time and the Epoch.” -- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#603914: Please drop non-UTF8 locales
On Sat, Nov 27, 2010 at 01:23:29PM +0200, Kalle Olavi Niemitalo wrote: Josselin Mouette j...@debian.org writes: I think wheezy would be a good time to finally ditch non-UTF8 locales. IIRC, we made the switch to UTF8 by default in etch (and we were already way too late in doing that), and supporting non-UTF8 stuff becomes harder and harder, at least for desktop software. In testing, the C and POSIX locales still don't use UTF-8. I don't know about unstable. They don't. But, see #522776. There's nothing preventing them being switched to UTF-8 from ASCII; it doesn't break any of the C, POSIX or SUS standards and some operating systems (HP-UX) already do this. It's certainly something I'd like to see done. It solves a number of annoying encoding and localisation issues, and it gives us UTF-8 support by default from end-to-end through the entire Debian system. Like the existing C locale, it would require hard-coding into libc, which I've tried doing but lack the detailed knowledge of glibc internals to do it properly. One issue is that the UTF-8 charmap should not be duplicated for each separate locale; currently it appears as though each locale needlessly has its own copy! We would want to share it given its size. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. signature.asc Description: Digital signature
Bug#603914: Please drop non-UTF8 locales
On Sun, Nov 28, 2010 at 05:21:33PM +, Thorsten Glaser wrote: Fun to be reading this. Me like ;-) Anyway. With my Debian hat on, the C/POSIX locales must not use UTF-8 as encoding, because otherwise, all kind of hell breaks loose (consider running 'tr u x' on a binary or other legacy encoded text file, and tr is just an example). From my reading of the standards a UTF-8 C locale would be required to behave identically to the existing ASCII C locale: • will consider all byte sequences valid • will use only the ASCII collation sequences (LC_COLLATE would be identical) • LC_CTYPE would probably also be identical (SUS specifies this less strictly than LC_COLLATE), but for backward compatibility should probably remain the same. About the only difference would be the lack of a need for the transliteration table, and the fact that the nl_langinfo(CODESET) would return UTF-8. That's pretty much it. I'd like to persue this in the long term, but I doubt I'll have the time to commit to it for several months. If anyone else wishes to tackle it, feel free to go for it! There are plans for C.UTF-8 though, and I’m a bit ashamed at having slacked off there… No worries, there's not much going to happen at this stage in the squeeze freeze. Hopefully easy to get added early in the wheezy cycle though! http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776 (the very end) and #609306 (same bug but a feature request for eglibc). Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. signature.asc Description: Digital signature
Bug#603914: Please drop non-UTF8 locales
On 11/12/2010 06:47, Kalle Olavi Niemitalo wrote: Vincent Danjean vdanj...@debian.org writes: For example, I've lots of old text data in latin1. Some of them are on non-rewritable media. Being able to see them with LC_CTYPE=fr_FR less toto.txt is very convenient. less does not convert the characters to UTF-8 for display, so you also need a latin1 terminal for that command, and typing file names in such a terminal will make them latin1 too, which is not nice. Alternatively: iconv --from-code=ISO-8859-1 toto.txt | less What you say is logical but I'm sure I only used LC_CTYPE=.. less ... to look at old documents. I just try some experiment to be sure (my terminal is urxvt, with LANG set to fr.FR.UTF-8, no other LC_* variables set). lat.txt contains latin-1 text,and utf.txt contains utf-8 text. Correct display (with ! when I found this strange): cat utf.txt ! cat lat.txt less utf.c LC_CTYPE=fr_FR less utf.c LC_CTYPE=fr_FR less lat.c Incorrect display less lat.c Now, run in a urxvt launched with LC_CTYPE=fr_FR Correct display: cat lat.txt less lat.txt Incorrect display: cat utf.txt less utf.txt LC_CTYPE=fr_FR.UTF-8 less utf.txt LC_CTYPE=fr_FR.UTF-8 less lat.txt There are also lots of old web pages written in latin1 that are still used (old exercises, ...) Not being able to see them properly on a Debian system would be a pain. I don't believe you need a latin1 locale for viewing latin1 web pages. The charset will be available from iconv() in any case, or browsers may have charset converters built in. Ok. I do not know what will stop to work if legacy locales are removed (ie I do not know what is using locales and what is using other mechanisms) Regards, Vincent -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#603914: Please drop non-UTF8 locales
Vincent Danjean vdanj...@debian.org writes: For example, I've lots of old text data in latin1. Some of them are on non-rewritable media. Being able to see them with LC_CTYPE=fr_FR less toto.txt is very convenient. less does not convert the characters to UTF-8 for display, so you also need a latin1 terminal for that command, and typing file names in such a terminal will make them latin1 too, which is not nice. Alternatively: iconv --from-code=ISO-8859-1 toto.txt | less There are also lots of old web pages written in latin1 that are still used (old exercises, ...) Not being able to see them properly on a Debian system would be a pain. I don't believe you need a latin1 locale for viewing latin1 web pages. The charset will be available from iconv() in any case, or browsers may have charset converters built in. Iceweasel 3.5.15 here can display Shift_JIS web pages all right even though I don't have any such locale installed. http://www.toei-anim.co.jp/tv/dejimon/ for example. pgpMgeXE2KXTN.pgp Description: PGP signature
Bug#603914: Please drop non-UTF8 locales
On 04/12/2010 00:15, Aurelien Jarno wrote: Hi, On Thu, Nov 18, 2010 at 02:07:37PM +0100, Josselin Mouette wrote: Package: locales Version: 2.11.2-7 Severity: wishlist Hi, I think wheezy would be a good time to finally ditch non-UTF8 locales. IIRC, we made the switch to UTF8 by default in etch (and we were already way too late in doing that), and supporting non-UTF8 stuff becomes harder and harder, at least for desktop software. I think we should make it clear that legacy locales are not supported anymore. Maybe by dropping them entirely, maybe by just not proposing them by default. Does dropping them really give huge benefits ? About the technical part, I would go to not proposing them by default in order to not break existing installations. I fully agree to not propose them by default on new installation. However, there are still *lots* of text documents written in legacy encodings. For example, I've lots of old text data in latin1. Some of them are on non-rewritable media. Being able to see them with LC_CTYPE=fr_FR less toto.txt is very convenient. There are also lots of old web pages written in latin1 that are still used (old exercises, ...) Not being able to see them properly on a Debian system would be a pain. And I work with a editor that still write its LaTeX document in latin1 (in order to not break older document and being able to merge old and new ones). So, I really think it is not time to totally drop non-UTF8 locales yet. Discouraging their uses is a good thing however. Regards, Vincent -- Vincent Danjean GPG key ID 0x9D025E87 vdanj...@debian.org GPG key fingerprint: FC95 08A6 854D DB48 4B9A 8A94 0BF7 7867 9D02 5E87 Unofficial packages: http://moais.imag.fr/membres/vincent.danjean/deb.html APT repo: deb http://people.debian.org/~vdanjean/debian unstable main -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#603914: Please drop non-UTF8 locales
Hi, On Thu, Nov 18, 2010 at 02:07:37PM +0100, Josselin Mouette wrote: Package: locales Version: 2.11.2-7 Severity: wishlist Hi, I think wheezy would be a good time to finally ditch non-UTF8 locales. IIRC, we made the switch to UTF8 by default in etch (and we were already way too late in doing that), and supporting non-UTF8 stuff becomes harder and harder, at least for desktop software. I think we should make it clear that legacy locales are not supported anymore. Maybe by dropping them entirely, maybe by just not proposing them by default. While I fully support this idea, I am sure if such a thing is decided and implemented by the glibc maintainers, we are going to spend a lot of time arguing and closing bug reports. It's always the same when you try to reduce the possible choices of users (see recent tzdata bug reports for example). That's why I think it should be decided by a more respectable team, such as the release team or even the tech ctte. What do you think would be the best? About the technical part, I would go to not proposing them by default in order to not break existing installations. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#603914: Please drop non-UTF8 locales
-BEGIN PGP SIGNED MESSAGE- Hash: SHA384 Hi! Fun to be reading this. Me like ;-) Anyway. With my Debian hat on, the C/POSIX locales must not use UTF-8 as encoding, because otherwise, all kind of hell breaks loose (consider running 'tr u x' on a binary or other legacy encoded text file, and tr is just an example). There are plans for C.UTF-8 though, and I’m a bit ashamed at having slacked off there… In MirBSD, I added a ‘-l’ command line option to script(1) to do the encoding for latin1-based terminals. This might just do the trick. If desired, I’ll prepare a patch against Debian’s. bye, //mirabilos - -- [...] if maybe ext3fs wasn't a better pick, or jfs, or maybe reiserfs, oh but what about xfs, and if only i had waited until reiser4 was ready... in the be- ginning, there was ffs, and in the middle, there was ffs, and at the end, there was still ffs, and the sys admins knew it was good. :) -- Ted Unangst über *fs -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (MirBSD) iQIVAwUBTPKPjna1NLLpkAfgAQnnNw/+LVEpYQdC0b1WjP3A831Sf/obHvVN7pEM 7sgAXuZjK5CF5l7vtzD0wrDpNtlCt24tvKFDK5sIEtzFoB/Y5vjC1fng96J5iO9i rOGQ38C3yoyNJveh+IMoNIx13DtxlSww6nTa5FPocUltUwR8uYreR6a5KlumCflo gtJtjGytkVFWM8BoR+Ou4bZ3QhdN+AWcwyCcRxvGJD4pDHQLNSiLR5JaRattktmi W+tQNrKUAt6QJMsliyC4p37TD3n9g/8slvnX8PtvOl3xiwgPavO3Dca3KwoAHXAC lXQ1hmMuG74GGkWZVU/Rs3/0zSdSkTjPFwW8snvvXGLqt1FBlSH07q37Q8qmeNqt RF/QS4I9TE05NjoEjTjFrfp+RelP7toeQAEC8E5Z2QTgA2/eWx/y8F1AvoZlJl9K 90rJv7kjlVeBh9EmgHfUJ5a2HLW4e0zn5j7ez3/jxENwMQbL+jCdgywOL+CTnzJ4 0XQaHT7fQBx9c8awmnwfrloDuYdZ0+JF8+/H30816ReASJKpQEm6wxftKRxBJVxK NXdz2RuwEOg4xFS9lu//w8rQW24I7dJxS99O7mJJqYhXTVHfAGOGxyEQsnT9lj2V +XKUpIKBlF9Fy7jjCPEjaUYlavlJIsJJVmGnDuZbxF0h8UuLFAdOXUgMrR+QFsX+ 0CSa54mTjNc= =EefN -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#603914: Please drop non-UTF8 locales
Josselin Mouette j...@debian.org writes: I think wheezy would be a good time to finally ditch non-UTF8 locales. IIRC, we made the switch to UTF8 by default in etch (and we were already way too late in doing that), and supporting non-UTF8 stuff becomes harder and harder, at least for desktop software. In testing, the C and POSIX locales still don't use UTF-8. I don't know about unstable. My latest use of LANG=fi_FI.ISO-8859-1 was when I had a VT420 connected to a serial port (largely for diagnosing graphics driver problems). No UTF-8 support there. However, even then, it would have been better to use LANG=fi_FI.UTF-8 and recode only the terminal I/O with some wrapper, so that file names would have been consistently UTF-8. The luit program cannot be used for that because it supports only UTF-8 terminals in legacy locales, rather than vice versa. If I remember correctly, tmux doesn't support such conversions either, but screen does. pgpKRGCt228ta.pgp Description: PGP signature
Bug#603914: Please drop non-UTF8 locales
Package: locales Version: 2.11.2-7 Severity: wishlist Hi, I think wheezy would be a good time to finally ditch non-UTF8 locales. IIRC, we made the switch to UTF8 by default in etch (and we were already way too late in doing that), and supporting non-UTF8 stuff becomes harder and harder, at least for desktop software. I think we should make it clear that legacy locales are not supported anymore. Maybe by dropping them entirely, maybe by just not proposing them by default. -- System Information: Debian Release: squeeze/sid APT prefers testing APT policy: (500, 'testing') Architecture: amd64 (x86_64) Kernel: Linux 2.6.32-5-amd64 (SMP w/2 CPU cores) Locale: LANG=fr_FR.UTF-8, LC_CTYPE=fr_FR.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages locales depends on: ii debconf [debconf-2.0] 1.5.36 Debian configuration management sy ii libc6 [glibc-2.11-1] 2.11.2-7 Embedded GNU C Library: Shared lib locales recommends no packages. locales suggests no packages. -- debconf information excluded -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org