Bug#603914: Please drop non-UTF8 locales

2011-01-09 Thread Thorsten Glaser
Roger Leigh dixit:

From my reading of the standards a UTF-8 C locale would be required
to behave identically to the existing ASCII C locale:

• will consider all byte sequences valid

I think it wouldn’t (since UTF-8 mbrtowc/wcrtomb don’t work
this way, and it can’t be done with “just” the POSIX API
anyway because they aren’t allowed to not read any input
byte when outputting (in MirBSD, I’ve added a sister func-
tion to mbrtowc which can do that), so not everything can
be accepted in all situations.

bye,
//mirabilos
-- 
  “Having a smoking section in a restaurant is like having
  a peeing section in a swimming pool.”
-- Edward Burr



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#603914: Please drop non-UTF8 locales

2011-01-09 Thread Roger Leigh
On Sun, Jan 09, 2011 at 10:21:50PM +, Thorsten Glaser wrote:
 Roger Leigh dixit:
 
 From my reading of the standards a UTF-8 C locale would be required
 to behave identically to the existing ASCII C locale:
 
 • will consider all byte sequences valid
 
 I think it wouldn’t (since UTF-8 mbrtowc/wcrtomb don’t work
 this way, and it can’t be done with “just” the POSIX API
 anyway because they aren’t allowed to not read any input
 byte when outputting (in MirBSD, I’ve added a sister func-
 tion to mbrtowc which can do that), so not everything can
 be accepted in all situations.

If you are using multibyte functions, then I agree these are special
cases.  For these to function correctly, they do require valid input.
They would of course fail when run in a UTF-8 C locale.  However, they
should fail in an ASCII C locale as well (I should test this) given
that the wide character representation is always UCS-4 on GNU/Linux
and an e.g. latin1 sequence wouldn't be valid UTF-8.

I think the all byte sequences valid applies mainly to narrow
character I/O.  i.e. printf/puts etc. won't alter, drop or otherwise
mangle any non 7-bit-ASCII codes.  i.e. I think the intent was to
ensure 8-bit cleanliness in a 7-bit locale.  This naturally extends
to UTF-8.  I'm not sure that wide character support is implied here,
given that it implicity requires correct byte sequences to function
where the narrow character I/O does not (all 8-bit codes are correct).


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?   http://gutenprint.sourceforge.net/
   `-GPG Public Key: 0x25BFB848   Please GPG sign your mail.


signature.asc
Description: Digital signature


Bug#603914: Please drop non-UTF8 locales

2011-01-09 Thread Thorsten Glaser
Roger Leigh dixit:

I think the all byte sequences valid applies mainly to narrow
character I/O.  i.e. printf/puts etc. won't alter, drop or otherwise
mangle any non 7-bit-ASCII codes.  i.e. I think the intent was to
ensure 8-bit cleanliness in a 7-bit locale.  This naturally extends
to UTF-8.  I'm not sure that wide character support is implied here,
given that it implicity requires correct byte sequences to function
where the narrow character I/O does not (all 8-bit codes are correct).

I was thinking in terms of programmes doing operation on wide characters
internally (for example, tr was the first one I switched to wide charac-
ters, since in MirBSD they use 16 bit, and the table driven design con-
tinued to work; this is also where I noticed the problem). Those are the
programmes you want to be aware of: they _are_ internationalised, thus
use wchar_t and multibytes and narrow I/O, or wchar_t and wide I/O, and
these will benefit from the C.UTF-8 locale; others (that just run on
byte strings as if they were characters) don’t see a difference between
it and the classical C locale anyway.

What I mean is, we try to use C.UTF-8 in places where we want to run
on text in UTF-8 but otherwise keep the normed predictable uniform
behaviour of C; in places where we operate on binary data C is pro-
bably more useful.

Hum. Do I make any sense?

Goodnight,
//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#603914: Please drop non-UTF8 locales

2011-01-08 Thread Roger Leigh
On Sat, Nov 27, 2010 at 01:23:29PM +0200, Kalle Olavi Niemitalo wrote:
 Josselin Mouette j...@debian.org writes:
 
  I think wheezy would be a good time to finally ditch non-UTF8 locales. 
  IIRC, we made the switch to UTF8 by default in etch (and we were already 
  way too late in doing that), and supporting non-UTF8 stuff becomes 
  harder and harder, at least for desktop software.
 
 In testing, the C and POSIX locales still don't use UTF-8.
 I don't know about unstable.

They don't.  But, see #522776.  There's nothing preventing them
being switched to UTF-8 from ASCII; it doesn't break any of the
C, POSIX or SUS standards and some operating systems (HP-UX)
already do this.

It's certainly something I'd like to see done.  It solves a number
of annoying encoding and localisation issues, and it gives us UTF-8
support by default from end-to-end through the entire Debian system.

Like the existing C locale, it would require hard-coding into libc,
which I've tried doing but lack the detailed knowledge of glibc
internals to do it properly.  One issue is that the UTF-8 charmap
should not be duplicated for each separate locale; currently it
appears as though each locale needlessly has its own copy!  We would
want to share it given its size.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?   http://gutenprint.sourceforge.net/
   `-GPG Public Key: 0x25BFB848   Please GPG sign your mail.


signature.asc
Description: Digital signature


Bug#603914: Please drop non-UTF8 locales

2011-01-08 Thread Roger Leigh
On Sun, Nov 28, 2010 at 05:21:33PM +, Thorsten Glaser wrote:
 Fun to be reading this. Me like ;-)
 
 Anyway. With my Debian hat on, the C/POSIX locales must not use
 UTF-8 as encoding, because otherwise, all kind of hell breaks
 loose (consider running 'tr u x' on a binary or other legacy
 encoded text file, and tr is just an example).

From my reading of the standards a UTF-8 C locale would be required
to behave identically to the existing ASCII C locale:

• will consider all byte sequences valid
• will use only the ASCII collation sequences (LC_COLLATE would be
  identical)
• LC_CTYPE would probably also be identical (SUS specifies this
  less strictly than LC_COLLATE), but for backward compatibility
  should probably remain the same.

About the only difference would be the lack of a need for the
transliteration table, and the fact that the nl_langinfo(CODESET)
would return UTF-8.  That's pretty much it.

I'd like to persue this in the long term, but I doubt I'll have the
time to commit to it for several months.  If anyone else wishes to
tackle it, feel free to go for it!

 There are plans
 for C.UTF-8 though, and I’m a bit ashamed at having slacked off
 there…

No worries, there's not much going to happen at this stage in the
squeeze freeze.  Hopefully easy to get added early in the wheezy
cycle though!

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776 (the very end)
and #609306 (same bug but a feature request for eglibc).


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?   http://gutenprint.sourceforge.net/
   `-GPG Public Key: 0x25BFB848   Please GPG sign your mail.


signature.asc
Description: Digital signature


Bug#603914: Please drop non-UTF8 locales

2010-12-14 Thread Vincent Danjean
On 11/12/2010 06:47, Kalle Olavi Niemitalo wrote:
 Vincent Danjean vdanj...@debian.org writes:
 
   For example, I've lots of old text data in latin1. Some of them are on
 non-rewritable media. Being able to see them with
 LC_CTYPE=fr_FR less toto.txt is very convenient.
 
 less does not convert the characters to UTF-8 for display, so you
 also need a latin1 terminal for that command, and typing file names
 in such a terminal will make them latin1 too, which is not nice.
 Alternatively: iconv --from-code=ISO-8859-1 toto.txt | less

What you say is logical but I'm sure I only used LC_CTYPE=.. less ...
to look at old documents. I just try some experiment to be sure (my terminal
is urxvt, with LANG set to fr.FR.UTF-8, no other LC_* variables set).
lat.txt contains latin-1 text,and utf.txt contains utf-8 text.

Correct display (with ! when I found this strange):
  cat utf.txt
! cat lat.txt
  less utf.c
  LC_CTYPE=fr_FR less utf.c
  LC_CTYPE=fr_FR less lat.c
Incorrect display
  less lat.c

Now, run in a urxvt launched with LC_CTYPE=fr_FR
Correct display:
  cat lat.txt
  less lat.txt
Incorrect display:
  cat utf.txt
  less utf.txt
  LC_CTYPE=fr_FR.UTF-8 less utf.txt
  LC_CTYPE=fr_FR.UTF-8 less lat.txt

   There are also lots of old web pages written in latin1 that are still
 used (old exercises, ...) Not being able to see them properly on a
 Debian system would be a pain.
 
 I don't believe you need a latin1 locale for viewing latin1 web
 pages.  The charset will be available from iconv() in any case,
 or browsers may have charset converters built in.

Ok. I do not know what will stop to work if legacy locales are
removed (ie I do not know what is using locales and what is using
other mechanisms)

  Regards,
Vincent



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#603914: Please drop non-UTF8 locales

2010-12-10 Thread Kalle Olavi Niemitalo
Vincent Danjean vdanj...@debian.org writes:

   For example, I've lots of old text data in latin1. Some of them are on
 non-rewritable media. Being able to see them with
 LC_CTYPE=fr_FR less toto.txt is very convenient.

less does not convert the characters to UTF-8 for display, so you
also need a latin1 terminal for that command, and typing file names
in such a terminal will make them latin1 too, which is not nice.
Alternatively: iconv --from-code=ISO-8859-1 toto.txt | less

   There are also lots of old web pages written in latin1 that are still
 used (old exercises, ...) Not being able to see them properly on a
 Debian system would be a pain.

I don't believe you need a latin1 locale for viewing latin1 web
pages.  The charset will be available from iconv() in any case,
or browsers may have charset converters built in.
Iceweasel 3.5.15 here can display Shift_JIS web pages
all right even though I don't have any such locale installed.
http://www.toei-anim.co.jp/tv/dejimon/ for example.


pgpMgeXE2KXTN.pgp
Description: PGP signature


Bug#603914: Please drop non-UTF8 locales

2010-12-05 Thread Vincent Danjean
On 04/12/2010 00:15, Aurelien Jarno wrote:
 
 Hi,
 
 On Thu, Nov 18, 2010 at 02:07:37PM +0100, Josselin Mouette wrote:
 Package: locales
 Version: 2.11.2-7
 Severity: wishlist

 Hi,

 I think wheezy would be a good time to finally ditch non-UTF8 locales. 
 IIRC, we made the switch to UTF8 by default in etch (and we were already 
 way too late in doing that), and supporting non-UTF8 stuff becomes 
 harder and harder, at least for desktop software.

 I think we should make it clear that legacy locales are not supported 
 anymore. Maybe by dropping them entirely, maybe by just not proposing 
 them by default.

Does dropping them really give huge benefits ?

 About the technical part, I would go to not proposing them by default in
 order to not break existing installations.

I fully agree to not propose them by default on new installation. However,
there are still *lots* of text documents written in legacy encodings.
  For example, I've lots of old text data in latin1. Some of them are on
non-rewritable media. Being able to see them with
LC_CTYPE=fr_FR less toto.txt is very convenient.
  There are also lots of old web pages written in latin1 that are still
used (old exercises, ...) Not being able to see them properly on a
Debian system would be a pain.
And I work with a editor that still write its LaTeX document in latin1
(in order to not break older document and being able to merge old and
new ones).
  So, I really think it is not time to totally drop non-UTF8
locales yet. Discouraging their uses is a good thing however.

  Regards,
Vincent
-- 
Vincent Danjean   GPG key ID 0x9D025E87 vdanj...@debian.org
GPG key fingerprint: FC95 08A6 854D DB48 4B9A  8A94 0BF7 7867 9D02 5E87
Unofficial packages: http://moais.imag.fr/membres/vincent.danjean/deb.html
APT repo:  deb http://people.debian.org/~vdanjean/debian unstable main




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#603914: Please drop non-UTF8 locales

2010-12-03 Thread Aurelien Jarno

Hi,

On Thu, Nov 18, 2010 at 02:07:37PM +0100, Josselin Mouette wrote:
 Package: locales
 Version: 2.11.2-7
 Severity: wishlist
 
 Hi,
 
 I think wheezy would be a good time to finally ditch non-UTF8 locales. 
 IIRC, we made the switch to UTF8 by default in etch (and we were already 
 way too late in doing that), and supporting non-UTF8 stuff becomes 
 harder and harder, at least for desktop software.
 
 I think we should make it clear that legacy locales are not supported 
 anymore. Maybe by dropping them entirely, maybe by just not proposing 
 them by default.
 

While I fully support this idea, I am sure if such a thing is decided
and implemented by the glibc maintainers, we are going to spend a lot of
time arguing and closing bug reports. It's always the same when you try
to reduce the possible choices of users (see recent tzdata bug reports
for example).

That's why I think it should be decided by a more respectable team, such
as the release team or even the tech ctte. What do you think would be
the best?

About the technical part, I would go to not proposing them by default in
order to not break existing installations.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#603914: Please drop non-UTF8 locales

2010-11-28 Thread Thorsten Glaser
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA384

Hi!

Fun to be reading this. Me like ;-)

Anyway. With my Debian hat on, the C/POSIX locales must not use
UTF-8 as encoding, because otherwise, all kind of hell breaks
loose (consider running 'tr u x' on a binary or other legacy
encoded text file, and tr is just an example). There are plans
for C.UTF-8 though, and I’m a bit ashamed at having slacked off
there…

In MirBSD, I added a ‘-l’ command line option to script(1) to
do the encoding for latin1-based terminals. This might just do
the trick. If desired, I’ll prepare a patch against Debian’s.

bye,
//mirabilos
- -- 
[...] if maybe ext3fs wasn't a better pick, or jfs, or maybe reiserfs, oh but
what about xfs, and if only i had waited until reiser4 was ready... in the be-
ginning, there was ffs, and in the middle, there was ffs, and at the end, there
was still ffs, and the sys admins knew it was good. :)  -- Ted Unangst über *fs
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MirBSD)

iQIVAwUBTPKPjna1NLLpkAfgAQnnNw/+LVEpYQdC0b1WjP3A831Sf/obHvVN7pEM
7sgAXuZjK5CF5l7vtzD0wrDpNtlCt24tvKFDK5sIEtzFoB/Y5vjC1fng96J5iO9i
rOGQ38C3yoyNJveh+IMoNIx13DtxlSww6nTa5FPocUltUwR8uYreR6a5KlumCflo
gtJtjGytkVFWM8BoR+Ou4bZ3QhdN+AWcwyCcRxvGJD4pDHQLNSiLR5JaRattktmi
W+tQNrKUAt6QJMsliyC4p37TD3n9g/8slvnX8PtvOl3xiwgPavO3Dca3KwoAHXAC
lXQ1hmMuG74GGkWZVU/Rs3/0zSdSkTjPFwW8snvvXGLqt1FBlSH07q37Q8qmeNqt
RF/QS4I9TE05NjoEjTjFrfp+RelP7toeQAEC8E5Z2QTgA2/eWx/y8F1AvoZlJl9K
90rJv7kjlVeBh9EmgHfUJ5a2HLW4e0zn5j7ez3/jxENwMQbL+jCdgywOL+CTnzJ4
0XQaHT7fQBx9c8awmnwfrloDuYdZ0+JF8+/H30816ReASJKpQEm6wxftKRxBJVxK
NXdz2RuwEOg4xFS9lu//w8rQW24I7dJxS99O7mJJqYhXTVHfAGOGxyEQsnT9lj2V
+XKUpIKBlF9Fy7jjCPEjaUYlavlJIsJJVmGnDuZbxF0h8UuLFAdOXUgMrR+QFsX+
0CSa54mTjNc=
=EefN
-END PGP SIGNATURE-



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#603914: Please drop non-UTF8 locales

2010-11-27 Thread Kalle Olavi Niemitalo
Josselin Mouette j...@debian.org writes:

 I think wheezy would be a good time to finally ditch non-UTF8 locales. 
 IIRC, we made the switch to UTF8 by default in etch (and we were already 
 way too late in doing that), and supporting non-UTF8 stuff becomes 
 harder and harder, at least for desktop software.

In testing, the C and POSIX locales still don't use UTF-8.
I don't know about unstable.

My latest use of LANG=fi_FI.ISO-8859-1 was when I had a VT420
connected to a serial port (largely for diagnosing graphics
driver problems).  No UTF-8 support there.  However, even then,
it would have been better to use LANG=fi_FI.UTF-8 and recode
only the terminal I/O with some wrapper, so that file names
would have been consistently UTF-8.  The luit program cannot
be used for that because it supports only UTF-8 terminals in
legacy locales, rather than vice versa.  If I remember correctly,
tmux doesn't support such conversions either, but screen does.


pgpKRGCt228ta.pgp
Description: PGP signature


Bug#603914: Please drop non-UTF8 locales

2010-11-18 Thread Josselin Mouette
Package: locales
Version: 2.11.2-7
Severity: wishlist

Hi,

I think wheezy would be a good time to finally ditch non-UTF8 locales. 
IIRC, we made the switch to UTF8 by default in etch (and we were already 
way too late in doing that), and supporting non-UTF8 stuff becomes 
harder and harder, at least for desktop software.

I think we should make it clear that legacy locales are not supported 
anymore. Maybe by dropping them entirely, maybe by just not proposing 
them by default.

-- System Information:
Debian Release: squeeze/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.32-5-amd64 (SMP w/2 CPU cores)
Locale: LANG=fr_FR.UTF-8, LC_CTYPE=fr_FR.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages locales depends on:
ii  debconf [debconf-2.0] 1.5.36 Debian configuration management sy
ii  libc6 [glibc-2.11-1]  2.11.2-7   Embedded GNU C Library: Shared lib

locales recommends no packages.

locales suggests no packages.

-- debconf information excluded



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org