Re: Status of UTF-8 Debian changelogs
On Sun, 08 Jun 2003, Wouter Verhelst wrote: [EMAIL PROTECTED]:~$ echo $LANG nl_BE.UTF-8 Is it in locale.gen? Otherwise, you will NOT have the locale information... which means that uxterm manually ensures that $LANG is set to something.UTF-8, since I set my $LANG to nl_BE. Ick. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: Status of UTF-8 Debian changelogs
On Mon, 2003-06-09 at 12:05, Henrique de Moraes Holschuh wrote: On Sun, 08 Jun 2003, Wouter Verhelst wrote: [EMAIL PROTECTED]:~$ echo $LANG nl_BE.UTF-8 Is it in locale.gen? Otherwise, you will NOT have the locale information... Ah, good call. We should have that in the default locale.gen.
Re: Status of UTF-8 Debian changelogs
Hi Colin! On Mon, 09 Jun 2003, Colin Walters wrote: On Mon, 2003-06-09 at 12:05, Henrique de Moraes Holschuh wrote: On Sun, 08 Jun 2003, Wouter Verhelst wrote: [EMAIL PROTECTED]:~$ echo $LANG nl_BE.UTF-8 Is it in locale.gen? Otherwise, you will NOT have the locale information... Ah, good call. We should have that in the default locale.gen. You'd need to add UTF8 locales for every locale, then. And they're often unsupported. I know for a fact pt_BR.UTF8 is unsupported (even if localegen claim it managed to generate it). -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: Status of UTF-8 Debian changelogs
On Thu, Jun 05, 2003 at 08:57:06PM -0400, Colin Walters wrote: JR the only thing that will change is that if someone complains at JR people who use UTF-8 in changelogs, a new retort will be JR available, THE POLICY MADE ME DO IT!!1!, or similar. CW Why would someone complain? I would complain. I am using KOI8-R terminal which can not display Latin-1 characters, and it seems backward to me to mandate or even allow _usage_ of UTF-8 ahead of getting it _supported_ across the system. I'd rather have 7-bit ASCII changelogs: why Latin-1 users are privileged to use native spelling of their names, while Cyrillic and Kanji and other users have to resort to transliteration? -- Dmitry Borodaenko
Re: Status of UTF-8 Debian changelogs
On Sat, Jun 07, 2003 at 04:59:29PM +0300, Dmitry Borodaenko wrote: On Thu, Jun 05, 2003 at 08:57:06PM -0400, Colin Walters wrote: JR the only thing that will change is that if someone complains at JR people who use UTF-8 in changelogs, a new retort will be JR available, THE POLICY MADE ME DO IT!!1!, or similar. CW Why would someone complain? I would complain. I am using KOI8-R terminal which can not display Latin-1 characters, Where did Latin-1 come into this? and it seems backward to me to mandate or even allow _usage_ of UTF-8 ahead of getting it _supported_ across the system. If you find yourself with a UTF-8 file, use a program which knows how to recode on the fly to your native encoding. Such programs are increasingly common. What do you lose here? Those who have fonts that can display the character in question will be able to do so; those who don't won't, but will see some reasonably obvious indicator like a ? or a filled-in square to show that the character is one they can't display. This is superior to the situation where those who don't have such fonts just see some gibberish. I'd rather have 7-bit ASCII changelogs: why Latin-1 users are privileged to use native spelling of their names, while Cyrillic and Kanji and other users have to resort to transliteration? They aren't so privileged. They may decide to do it anyway, but since the encoding of changelogs is not yet specified you currently take pot luck on anything outside 7-bit ASCII. I believe you've just contradicted yourself, anyway. Nobody wants to have to transliterate their name. I don't want to have to transliterate the names of people who help me with my packages when I credit them in the changelog; in some cases I may not even know how to transliterate their names correctly. UTF-8 allows me to spell their names correctly. At worst, a couple of characters may not be displayed properly for people using legacy encodings who don't have software that can recode for them, but if I'd artificially transliterated to 7-bit ASCII then nobody would get to see the correct spellings anyway. Since UTF-8 includes ASCII, all the technical content of my changelogs will still appear normally no matter what locale you're using, but suddenly it becomes possible for me to credit my contributors properly regardless of whether they come from Spain, Russia, or Japan. We're not talking about mandating the use of UTF-8 across the whole system here. We're talking about recommending its use in one particular case where it gives a small but real benefit, and where the consequences of getting it wrong are not very important (we can always go back and recode a few changelogs if some unforeseen badness results). Think of it as a safe experiment in advance of wider deployment of UTF-8 later on. Package maintainers who aren't set up for writing UTF-8 can always resort to transliteration into ASCII if need be. -- Colin Watson [EMAIL PROTECTED]
Re: Status of UTF-8 Debian changelogs
On Sat, 2003-06-07 at 09:59, Dmitry Borodaenko wrote: I am using KOI8-R terminal which can not display Latin-1 characters, and it seems backward to me to mandate or even allow _usage_ of UTF-8 ahead of getting it _supported_ across the system. A growing amount of software in Debian has UTF-8 support. I have been using a fully UTF-8 locale for some time. At least gnome-terminal has excellent support for UTF-8; xterm has support too if you invoke it as 'uxterm'. So I think the support is already here. I'd rather have 7-bit ASCII changelogs: why Latin-1 users are privileged to use native spelling of their names, while Cyrillic and Kanji and other users have to resort to transliteration? I think after we switch to UTF-8, there's no reason why you should.
Re: Status of UTF-8 Debian changelogs
On Sat, Jun 07, 2003 at 04:21:33PM +0100, Colin Watson wrote: DB I am using KOI8-R terminal which can not display Latin-1 DB characters, CW Where did Latin-1 come into this? I said characters, not encoding, and I mean that KOI8-R character set does not include characters from Latin-1. Therefore, these characters need to be replaced with '?', as you point out below. CW What do you lose here? Those who have fonts that can display the CW character in question will be able to do so; those who don't won't, CW but will see some reasonably obvious indicator like a ? or a CW filled-in square to show that the character is one they can't CW display. This is superior to the situation where those who don't CW have such fonts just see some gibberish. ... I don't see it as a proper credit to your contributors if their name appears as 'J?rg?n' (or even '' in case of Kanji) on my display. Were it transliterated, I would at least be able to pronounce it (and there are standard rules for such transliteration anyway (I even think iconv should have an option to do lossy transliteration for characters outside of target character set)). DB I'd rather have 7-bit ASCII changelogs: why Latin-1 users are DB privileged to use native spelling of their names, while Cyrillic DB and Kanji and other users have to resort to transliteration? CW They aren't so privileged. They may decide to do it anyway, but CW since the encoding of changelogs is not yet specified you currently CW take pot luck on anything outside 7-bit ASCII. What I objected to is that they may: I'd rather they may not. I'd rather encoding of changelogs was specified to be 7-bit ASCII. CW I believe you've just contradicted yourself, anyway. Nobody wants CW to have to transliterate their name. Excuse me for ad hominem, but how many foreign languages do you speak? The reason I'm asking is that my observation is that people from countries with completely non-ASCII writing system (as opposed to European Latin-based languages) almost always do transliterate their names when they communicate with someone speaking a different language. Do you observe a different pattern? You see, it is not only a technical issue, it is a communication issue. If you can't read Cyrillic, native spelling of my name wouldn't help you to read it, even if it is displayed correctly. ... CW Package maintainers who aren't set up for writing UTF-8 can always CW resort to transliteration into ASCII if need be. The biggest compromise you can convince me to with that argument, is to allow to put non-ASCII names in UTF-8 into changelogs, but only if such name is accompanied by ASCII transliteration. But that solution is substantially more complex than just limiting changelogs to 7-bit ASCII, and there is no easy way to check for compliance. -- Dmitry Borodaenko
Re: Status of UTF-8 Debian changelogs
On Sat, 2003-06-07 at 13:43, Dmitry Borodaenko wrote: I don't see it as a proper credit to your contributors if their name appears as 'J?rg?n' (or even '' in case of Kanji) on my display. That's a problem with your display. What I objected to is that they may: I'd rather they may not. I'd rather encoding of changelogs was specified to be 7-bit ASCII. I think that's just like giving up. It will make life more painful for everyone. Excuse me for ad hominem, but how many foreign languages do you speak? The reason I'm asking is that my observation is that people from countries with completely non-ASCII writing system (as opposed to European Latin-based languages) almost always do transliterate their names when they communicate with someone speaking a different language. Of course, this is likely because it wasn't until fairly recently (i.e. the last year or two) that GNU/Linux got some basic support for their writing systems. So they essentially had to transliterate. But now with UTF-8 there's a better choice, and they can use their real name. The biggest compromise you can convince me to with that argument, is to allow to put non-ASCII names in UTF-8 into changelogs, but only if such name is accompanied by ASCII transliteration. But that solution is substantially more complex than just limiting changelogs to 7-bit ASCII, and there is no easy way to check for compliance. That's something that an individual maintainer could decide to do. Perhaps they could include a transliteration in quotation marks, like: カゼチ Junichrio Koizumi [EMAIL PROTECTED]. My apologies if the above is some grave insult in Japanese; I just picked some random Katakana in gucharmap :) Anyways, I think transliteration is largely a separate issue from the encoding of the changelog. Using UTF-8 doesn't force people to stop transliterating.
Re: Status of UTF-8 Debian changelogs
On Sat, Jun 07, 2003 at 04:21:33PM +0100, Colin Watson wrote: What do you lose here? Those who have fonts that can display the character in question will be able to do so; those who don't won't, but will see some reasonably obvious indicator like a ? or a filled-in square to show that the character is one they can't display. This is superior to the situation where those who don't have such fonts just see some gibberish. Superior? No way, it's just as bad. Whether the noise is gibberish, or whether it consist of question marks or cute little squares doesn't make any difference at all. -- Wouter Verhelst Debian GNU/Linux -- http://www.debian.org Nederlandstalige Linux-documentatie -- http://nl.linux.org An expert can usually spot the difference between a fake charge and a full one, but there are plenty of dead experts. -- National Geographic Channel, in a documentary about large African beasts. pgpZMFR75GCsR.pgp Description: PGP signature
Re: Status of UTF-8 Debian changelogs
On Sat, Jun 07, 2003 at 09:31:26PM +0200, Wouter Verhelst wrote: On Sat, Jun 07, 2003 at 04:21:33PM +0100, Colin Watson wrote: What do you lose here? Those who have fonts that can display the character in question will be able to do so; those who don't won't, but will see some reasonably obvious indicator like a ? or a filled-in square to show that the character is one they can't display. This is superior to the situation where those who don't have such fonts just see some gibberish. Superior? No way, it's just as bad. Whether the noise is gibberish, or whether it consist of question marks or cute little squares doesn't make any difference at all. Except that UTF8 is non-destructive when interpreted as any other character set. The same cannot be said of many other character sets: trying to display some Western charsets on some CJK terminals can cause codepage shifts that corrupt the display of the remainder of the text, IIRC. -- Steve Langasek postmodern programmer pgpFTIcsa8Wav.pgp Description: PGP signature
Re: Status of UTF-8 Debian changelogs
On Sat, 2003-06-07 at 15:36, Wouter Verhelst wrote: Yeah, but it's not always as good as the legacy support is. For instance, last I tried uxterm (like, 2 minutes ago), I put in a euro sign somewhere. Which appeared correctly (hurray), but doing backspace over that didn't do what it was supposed to do, in that only one of the three unicode bytes was removed (bug not filed yet, will do if I don't forget, and find the time to investigate properly). Are you using zsh? I get that kind of behavior with it, but bash works ok. This is unfortunate because I really like zsh otherwise :/
Re: Status of UTF-8 Debian changelogs
On Sat, Jun 07, 2003 at 04:17:15PM -0400, Colin Walters wrote: On Sat, 2003-06-07 at 15:36, Wouter Verhelst wrote: Yeah, but it's not always as good as the legacy support is. For instance, last I tried uxterm (like, 2 minutes ago), I put in a euro sign somewhere. Which appeared correctly (hurray), but doing backspace over that didn't do what it was supposed to do, in that only one of the three unicode bytes was removed (bug not filed yet, will do if I don't forget, and find the time to investigate properly). Are you using zsh? I get that kind of behavior with it, but bash works ok. No, I'm using bash... -- Wouter Verhelst Debian GNU/Linux -- http://www.debian.org Nederlandstalige Linux-documentatie -- http://nl.linux.org An expert can usually spot the difference between a fake charge and a full one, but there are plenty of dead experts. -- National Geographic Channel, in a documentary about large African beasts.
Re: Status of UTF-8 Debian changelogs
[ no need to CC me ] On Sat, 2003-06-07 at 17:39, Wouter Verhelst wrote: No, I'm using bash... Weird. It works here. What's your $LANG? If you're inputting Unicode it should probably be something.UTF-8.
Re: Status of UTF-8 Debian changelogs
On Sat, Jun 07, 2003 at 05:58:28PM -0400, Colin Walters wrote: [ no need to CC me ] On Sat, 2003-06-07 at 17:39, Wouter Verhelst wrote: No, I'm using bash... Weird. It works here. What's your $LANG? If you're inputting Unicode it should probably be something.UTF-8. it is: [EMAIL PROTECTED]:~$ echo $LANG nl_BE.UTF-8 which means that uxterm manually ensures that $LANG is set to something.UTF-8, since I set my $LANG to nl_BE. Anyway, this is way offtopic here. My point was yes, there is some unicode-support in Debian, but no, it's not working flawlessly yet. If you have any other ideas (finding the exact issue still is a worthwile goal), please send it by private mail. -- Wouter Verhelst Debian GNU/Linux -- http://www.debian.org Nederlandstalige Linux-documentatie -- http://nl.linux.org An expert can usually spot the difference between a fake charge and a full one, but there are plenty of dead experts. -- National Geographic Channel, in a documentary about large African beasts.
Re: Status of UTF-8 Debian changelogs
On Fri, Jun 06, 2003 at 01:17:00PM +0200, Jérôme Marant wrote: I don't see all those (7|8)-bit-charset-using people requiring the same... Policy would mean all of them in the same charset, UTF-8 that is. The issue call for two comments: 1) Changelog are required to be written in english, so non 7bit characters should be rare, and use of non latin-1 characters are probably not a good idea. For example, writing the name of a developer with japanese characters might cause problem to people reading the changelog understanding who is referred to. This is unfortunate. 2) People write changelog with whatever locales they use for development. Requiring them to use special tool for writing changelog would be a pain. I don't know how far lintian can check for UTF-8 encoding. Cheers, -- Bill. [EMAIL PROTECTED] Imagine a large red swirl here.
Re: Status of UTF-8 Debian changelogs
On Fri, Jun 06, 2003 at 06:37:11PM +0200, Bill Allombert wrote: On Fri, Jun 06, 2003 at 01:17:00PM +0200, Jérôme Marant wrote: I don't see all those (7|8)-bit-charset-using people requiring the same... Policy would mean all of them in the same charset, UTF-8 that is. The issue call for two comments: 1) Changelog are required to be written in english, so non 7bit characters should be rare, and use of non latin-1 characters are probably not a good idea. For example, writing the name of a developer with japanese characters might cause problem to people reading the changelog understanding who is referred to. This is unfortunate. 2) People write changelog with whatever locales they use for development. Requiring them to use special tool for writing changelog would be a pain. I don't know how far lintian can check for UTF-8 encoding. Of course, these comments give contradictory rationales. The one says that mandating UTF-8 is bad because people shouldn't use non-ASCII characters in changelogs; the other says that mandating UTF-8 is bad because it makes it harder for people to use non-ASCII characters in changelogs. I argue that the latter is a *good* thing; and where exceptions are permitted, they should be encoded using a common character set. Checking for non-UTF8 characters in a changelog is trivial. Dump the file through 'iconv -f utf-8 -t ucs-4', discard the output, and check the return value. If there are any characters in the stream which are invalid UTF-8 sequences, iconv will exit with an error code; and this will be the case for the vast majority of other character sets. -- Steve Langasek postmodern programmer pgpowTkkl06ur.pgp Description: PGP signature
Re: Status of UTF-8 Debian changelogs
On Fri, 2003-06-06 at 12:37, Bill Allombert wrote: 1) Changelog are required to be written in english, so non 7bit characters should be rare, and use of non latin-1 characters are probably not a good idea. For example, writing the name of a developer with japanese characters might cause problem to people reading the changelog understanding who is referred to. This is unfortunate. 2) People write changelog with whatever locales they use for development. Requiring them to use special tool for writing changelog would be a pain. For some of us, the locale encoding is UTF-8. Besides, if you want to continue using a legacy editor, it should be trivial to convert from whatever locale encoding you're using into UTF-8 when building the binary package using iconv. Basically just something like this: iconv -f ISO-8859-1 -t UTF-8 debian/changelog debian/foo/usr/share/doc/foo/changelog.Debian gzip -9qf debian/foo/usr/share/doc/foo/changelog.Debian I don't know how far lintian can check for UTF-8 encoding. Actually I sent in a patch for this 153 days ago. Bug #175318.
Status of UTF-8 Debian changelogs
Hi, I've seen some UTF-8-encoded debian/changelog files but I haven't seen anything mentioning it is allowed in Debian Policy. According to #174982, the proposal has been accepted but the bug is still open. When is this planned for? Thanks. -- Jérôme Marant
Re: Status of UTF-8 Debian changelogs
On Thu, Jun 05, 2003 at 01:35:38PM +0200, Jérôme Marant wrote: I've seen some UTF-8-encoded debian/changelog files but I haven't seen anything mentioning it is allowed in Debian Policy. According to #174982, the proposal has been accepted but the bug is still open. When is this planned for? Ahm. You need it written in the Policy manual to use a 16-bit charset? I don't see all those (7|8)-bit-charset-using people requiring the same... -- 2. That which causes joy or happiness.
Re: Status of UTF-8 Debian changelogs
On Thu, Jun 05, 2003 at 02:23:36PM +0200, Josip Rodin wrote: On Thu, Jun 05, 2003 at 01:35:38PM +0200, Jérôme Marant wrote: I've seen some UTF-8-encoded debian/changelog files but I haven't seen anything mentioning it is allowed in Debian Policy. According to #174982, the proposal has been accepted but the bug is still open. When is this planned for? Ahm. You need it written in the Policy manual to use a 16-bit charset? ^^ multibyte encoding. If they were using a 16-bit character set, we'd have to kill them for creating files that can't be processed as C strings. :) -- Steve Langasek postmodern programmer pgpPNS22Suzkx.pgp Description: PGP signature
Re: Status of UTF-8 Debian changelogs
On Thu, 2003-06-05 at 08:23, Josip Rodin wrote: Ahm. You need it written in the Policy manual to use a 16-bit charset? As Steve points out, the size of the code space isn't particularly relevant. I don't see all those (7|8)-bit-charset-using people requiring the same... The problem is that we have no way to know what encoding an individual Debian Changelog entry is in. This is actually important for stuff like apt-listchanges. I constantly see broken characters in Debian changelogs in apt-listchanges from people using ISO-8859-1, when my terminal speaks UTF-8 natively. If you're using an ISO-8859-1 terminal, then apt-listchanges could recode the changelogs from UTF-8 to ISO-8859-1 (or try, anyways). And since my terminal speaks UTF-8, apt-listchanges could just pass it on asis. A situation where it can just be any encoding (or even a mix, if say a speaker of an ISO-8859-2 language later takes over from the previous ISO-8859-1 maintainer) is just terribly tbroken. UTF-8 is the one and only sane choice. This policy amendment got a number of seconds, so unless you can raise a coherent objection, I think it should go in.
Re: Status of UTF-8 Debian changelogs
On Thu, Jun 05, 2003 at 02:58:12PM -0400, Colin Walters wrote: The problem is that we have no way to know what encoding an individual Debian Changelog entry is in. The problem is that my point entirely flew over your head. The point was, as usual, that Policy is not designed to be a stick to beat people with, and that it does not have to precede implementation. You can already complain at people who use e.g. Latin 1 in changelogs. Once a released version of the Policy manual gets a shiny and bright new sentence saying Use Unicode (just in a roundabout, somewhat patronizing kind of way), the only thing that will change is that if someone complains at people who use UTF-8 in changelogs, a new retort will be available, THE POLICY MADE ME DO IT!!1!, or similar. Oh, and insert another standard rant here on how the fact something hasn't been done does not automatically imply that those who haven't done it are obstructionist sadistic bastards. -- 2. That which causes joy or happiness.
Re: Status of UTF-8 Debian changelogs
On Thu, Jun 05, 2003 at 10:40:07PM +0200, Josip Rodin wrote: On Thu, Jun 05, 2003 at 02:58:12PM -0400, Colin Walters wrote: The problem is that we have no way to know what encoding an individual Debian Changelog entry is in. The problem is that my point entirely flew over your head. The point was, as usual, that Policy is not designed to be a stick to beat people with, and that it does not have to precede implementation. You can already complain at people who use e.g. Latin 1 in changelogs. Once a released version of the Policy manual gets a shiny and bright new sentence saying Use Unicode (just in a roundabout, somewhat patronizing kind of way), the only thing that will change is that if someone complains at people who use UTF-8 in changelogs, a new retort will be available, THE POLICY MADE ME DO IT!!1!, or similar. Common sense already dictates that untagged, non-ASCII characters should not be used in documents that must be parsed in a multilingual environment (e.g., the planet Earth). Specifying UTF8 as an encoding for changelogs is to *permit* something which is desirable but not sensibly achievable in the absence of a policy for it. I'm more than happy to beat people for using non-UTF8 characters in changelog with the stick I'm currently holding -- no need to roll up Policy for this purpose. ;) -- Steve Langasek postmodern programmer pgpVpbb2yFzaD.pgp Description: PGP signature
Re: Status of UTF-8 Debian changelogs
On Thu, 2003-06-05 at 16:40, Josip Rodin wrote: On Thu, Jun 05, 2003 at 02:58:12PM -0400, Colin Walters wrote: The problem is that we have no way to know what encoding an individual Debian Changelog entry is in. The problem is that my point entirely flew over your head. The point was, as usual, that Policy is not designed to be a stick to beat people with, and that it does not have to precede implementation. You certainly had a strange way of stating this; your initial reply seemed to focus on the size of the code space of the character sets. Anyways, you could consider this as already mostly implemented; the vast majority of changelogs are pure ASCII; there's only a few people using ISO-8859-x and UTF-8. Given the disadvantages of the former, we should standardize on the latter, and that's what this policy amendment is all about. You can already complain at people who use e.g. Latin 1 in changelogs. Once a released version of the Policy manual gets a shiny and bright new sentence saying Use Unicode (just in a roundabout, somewhat patronizing kind of way), I see no reason for it to be either roundabout or patronizing; perhaps you could suggest an alternative wording that would remove these perceived qualities? the only thing that will change is that if someone complains at people who use UTF-8 in changelogs, a new retort will be available, THE POLICY MADE ME DO IT!!1!, or similar. Why would someone complain? Oh, and insert another standard rant here on how the fact something hasn't been done does not automatically imply that those who haven't done it are obstructionist sadistic bastards. I never implied such, or if I did it was certainly not my intention. I think you've been doing a great job as a policy editor, and I assume that not adding this amendment was just an oversight.