Re: Switching to UTF-8
Markus Kuhn [EMAIL PROTECTED] writes: c) Emacs - Current Emacs UTF-8 support is still a bit too provisional for my comfort. In particular, I don't like that the UTF-8 mode is not binary transparent. Work on turning Emcas completely into a UTF-8 editor is under way, and I'd be very curious to hear about the current status and whether there is anything to test already. Anyone? AFAIK, there is some activity on the Emacs 22 branch. XEmacs is in the process of switching to UCS for its internal character set, too. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
Markus Kuhn [EMAIL PROTECTED] writes: As we are talking about en_US.UTF-8: General warning: Please do not use the locale name en_US.UTF-8 anywhere outside North America. Why can't you use it for LC_CTYPE and LC_MESSAGES, say? Determining paper size by locale is rather strange. What's next? Keyboard layout? Mouse orientation? Monitor size? -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: POSIX:2001 now available online
Markus Kuhn [EMAIL PROTECTED] writes: The revised POSIX standard, which has been merged with the Single UNIX Specification is now available online in HTML! It is complicated to look up sections by their number. Or am I missing something? -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [I18n]Re: Li18nux Locale Name Guideline Public Review
Bram Moolenaar [EMAIL PROTECTED] writes: Ignoring case does not appear to lead to compatibility problems. It does. Case is used to separate public and private namespace (probably a design mistake). However, we shuld ignore case in the charset: we are going to use mainly MIME charset names (at least I hope so), and MIME charset names are case insensitive. Anyway, the case sensitivity issue is a strawman, IMHO. If there is a single, system-wide locale database, the name of a locale becomes much less an issue (as it should never be transmitted over the wire). Current experience with GNU libc and XFree86 4.1.x shows that enriching locale data based on the name simply does not work. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Free availability of ISO/IEC standards
Keld Jørn Simonsen [EMAIL PROTECTED] writes: Can't you get access to them in the onsite department of the library? (That is, the department where you cannot loan the books, but only read them onsite). No, definitely not. The librarians don't even know how to get those standards (ISO and IEEE). There are *copies* of DIN standards in the university library, but the archive is far from complete, and you never know (without independent checking) if you've missed an important TC. Thanks to modern technology, I can query the quite a few library catalogs simultaneously. For example, the libraries in Southwest Germany have got books with coded character sets in the title, but all of them are ECMA standards. I gather that I am in a lucky position living in a big city in one of the more developed countries of the world, but generally universities in all countries at least in the industrialized world have systems so a student can get hold of any major technical book (this is essential for a university to fullfill its mission) and often general public can get access too if they are persistant enough. Of course, you can ask the university to buy the book. I've been told this wouldn't be a problem, although you can't use the $18 PDF option in this case. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: unicode in emacs 21
Eli Zaretskii [EMAIL PROTECTED] writes: The GNU Emacs/Unicode proposal I've seen seems to have this property, too. (At least the proposal is ambiguous, and one interpretation is that you can encode a single character in multiple ways.) Unless you refer to the CNS plane and Japanese Han characters, which were deliberately left ununified (in addition to the Unicode codepoints for those characters), I think you are mistaken. I hope so. ;-) Could you please point out where in the proposal do you see that a character can be encoded in multiple ways? I think now that the surrogate stuff has been explained, the encoding to to UCS-E (Unicode-compatible Character Set for Emacs) is indeed unambiguous. However, UTF-E (the buffer encoding) opens possibilities for different encodings of the same UCS-E code point, but this can be resolved, I think. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode in Emacs again
Kenichi Handa [EMAIL PROTECTED] writes: Florian Weimer [EMAIL PROTECTED] writes: What does 'via surrogate pair' mean? I guess the second line should read: 00 Unicode 20bit (U+1 - U+F) Yes. That's correct, and the third line shoud read as below: 01 Unicode 20bit (U+10 - U+10) I'm still not convinced it's correct. My current understanding is that it should be: 00 Unicode 20 bit (U+00 - U+0F) 01 Unicode 20.08... bit (U+10 - U+10) I'm currently reading the emacs-unicode mailing list, and it seems a few essential issues weren't on the horizon back then. Shall I send a comment to the emacs-unicode mailing list if I'm finished? -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
A more verbose version of Emacs-Unicode-990824
Unicode Support for GNU Emacs * This memo documents the current plans for bringing Unicode to GNU Emacs. It describes the requirements and constraints for the new Emacs character set, and the transformation format used to encode this character set in buffers. It reflects the discussion on the `emacs-unicode' mailing list and the `Emacs-Unicode-990824' proposal. Version $Revision: 1.1 $, written by Florian Weimer. Requirements The internal character code of a character has to fit in 22 bits. (The remaining bits of a 32 bit host integer are required for tagging.) The representation of characters in buffers and strings has to be compact. 22 and more bits per ASCII character are not acceptable. Latin scripts are unified. There are strong reservations regarding Han unification. Emacs must be able to display Han characters using a font which matches the expectations of CJKV users. In addition, there are some character sets for which no corresponding code points have been assigned yet in Unicode. The Emacs character set should deviate as little as possible from the Unicode character set (and similarly, from other included character sets). Each deviation has to be documented, and since documentation is now widely available [Unicode], it does not make sense to rewrite this documentation from scratch. (Up to this point, these requirements were mentioned in previous discussions on the `emacs-unicode' mailing list.) We should assume that UTF-8 [RFC 2279] becomes the dominant character set on GNU systems. Users will want to enable it by default. Therefore, we have to guarantee the following things: * Emacs must be able to read any file in UTF-8, even if it contains invalid UTF-8 sequences. * If a file is read into Emacs and written again without editing, the written file must match the original, including possibly broken UTF-8 sequences. * If the user instructs Emacs to read a file, edits a certain part, and writes it back, portions wich have not been edited should not change in any way (even in the presence of broken UTF-8 sequences). On some proprietary platforms, there is a strong trend towards UTF-16, and similar requirements apply there (with broken surrogate pairs instead broken UTF-8 sequences). Rejected Requirements = Latin unification means that it is not possible to read an ISO 2022 encoded file (which might contain several scripts from ISO 8859 unified in Unicode), and write it back again, so that it matches the original. In addition, the shape of accents varies from one Latin script to another, and those accents are unified in Unicode. This might introduce slight typographic in accuracies if the wrong font is chosen, which seem, however, to be acceptable in a text editor. Tools Available for Implementation == We can achieve the Latin unification by either carefully unifying the existing MULE charsets, or by switch to Unicode. Because of other requirements, in particular documentation, the latter seems to desirable. There are several approaches for working around Han unification: * plane 14 language tags [Plane14] (now an official part of Unicode) * text properties * separate CJKV character sets (in particular for KJV users, C seems to be not so problematic) A language tag in each character is not possible because of the 22 bit limit for a character code. Because of the need for a Han unification workaround, straightforward UCS-4 cannot be used for the Emacs character set. The Current Proposal The GNU Emacs Emacs Proposal consists of two parts: A character set, and an encoding of this character set for use in buffers and strings. Basic semantics have not been discussed much yet. The Emacs Character Set --- The Unicode-compatible Character Set for Emacs (UCS-E) is based on UCS-4. In the following, we use the U+ABCDEF notation (where ABCDEF are hexadecimal digits) to refer to UCS-4 characters, and the E+ABCDEF notation to refer to characters in UCS-E. The character range E+00 up to E+10 is identical to UCS-4 (U+00 up to U+10, 17 planes of 65,536 code points each). This is exactly the range which is addressable using surrogate pairs and UTF-16. However, the planes beyond this range are used differently: planes 17 to 23 are reserved for Emacs (E+11-E+17), planes 24 to 31 are intended for private use (E+18-E+1F), and planes 32 to 63 are partly used for encoding CJK characters, partly for private use characters (E+20-E+3F). This results in the following picture, with bit masks in the first column: 00 Unicode U+00 - U+0F 01 Unicode U+10 - U+10 01 0ppp 7 64K planes reserved
Re: unicode in emacs 21
Richard Stallman [EMAIL PROTECTED] writes: Supporting Unicode superficially while retaining the current internal representation raises a number of problems, one of them being that the internal representation has several alternatives for the same character which correspond to the same code point in Unicode. The GNU Emacs/Unicode proposal I've seen seems to have this property, too. (At least the proposal is ambiguous, and one interpretation is that you can encode a single character in multiple ways.) - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: unicode in emacs 21
H. Peter Anvin [EMAIL PROTECTED] writes: Does that mean you're painting yourself into a corner, though, requiring manual work to integrate the increasingly Unicode-based infrastructure support that is becoming available? Odds are pretty good that they are. I don't think it is a good idea to use operating system Unicode support. This would mean that GNU Emacs behaves differently on different operating systems, depending on the installed locale descriptions, for example. OTOH, the character encodings posted earlier to this list are as incompatible with existing Unicode support as the current emacs-mule internal encoding. In effect, just one Emacs-specific internal encoding is replaced by another. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: unicode in emacs 21
Eli Zaretskii [EMAIL PROTECTED] writes: Why can't you continue to use the MULE code and just change the character sets to reflect certain aspects of Unicode? The current plan for Unicode was discussed at length 3 years ago, and the result was what I described. Is the discussion archived somewhere, or are there some design documents which resulted from the discussion? I don't think it's wise for us to reopen that discussion again, unless you think the UTF-8-based representation is a terribly wrong design. Of course, it's hard to come up with constructive criticism when you don't know what's already there. ;-) So I don't see any reason for the unnamed Unicode people to get annoyed by a term they themselves coined. Me neither, but I got flamed in the past. :-/ Conceivably, changing the internal representation doesn't mean we need to rewrite all of the existing code, just the low-level parts of it that deal with code conversions (i.e. subroutines of encoding and decoding functions). I still don't understand the need for such a change. In theory, the internal representation of characters should be invisible to the higher levels. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: unicode in emacs 21
Eli Zaretskii [EMAIL PROTECTED] writes: Emacs cannot use a pure UTF-8 encoding, since some cultures don't want unification, and it was decided that Emacs should not force unification on those cultures. Why can't you continue to use the MULE code and just change the character sets to reflect certain aspects of Unicode? One such aspect is Latin unification, for example. (The Unicode people get very annoyed if you talk about unification, source separation rule etc. in the context of non-Han scripts...) In a second step, support for normalization, combining characters etc. would have to be added, but this could be based on the reliable foundation of the old MULE code. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: UTF16 and GCC
[EMAIL PROTECTED] (Kai Henningsen) writes: * Do we need a native wide char encoding, too (mostly for Win32 where it's UTF-16, but possibly also some Asian thing)? A single 'char' encoded in UTF-16? This sounds horrible. I can't quite parse that. If you've got a 16 bit wchar_t, there's no way that it can store characters encoded in UTF-16. What happens to characters outside the BMP? 16 bit wchar_t on C makes only sense in conjunction with UCS-2. All C functions working on wide characters can only deal with characters in the BMP anyway, even if you permit encoding wchar_t * strings in UTF-16. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: UTF16 and GCC
[EMAIL PROTECTED] (Kai Henningsen) writes: * Do we need a native wide char encoding, too (mostly for Win32 where it's UTF-16, but possibly also some Asian thing)? A single 'char' encoded in UTF-16? This sounds horrible. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Word and Antiword
Markus Kuhn [EMAIL PROTECTED] writes: Antiword is available from http://www.winfield.demon.nl/ and provides significantly better DOC - plaintext conversion than any Micorsoft product. Unfortunately, this is not true. It fails badly on Word documents with embedded change history, like any other third-party converter I've tested so far. This can be quite dangerous because the extracted plaintext can differ substantially from what a Word users sees on the screen. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: file name encoding
Bruno Haible [EMAIL PROTECTED] writes: The programs we are waiting for are: - emacs. In an UTF-8 locale, it does not set the keyboard-coding-system to UTF-8, thus when I type umlaut keys strange things happen. And it does not set the default file encoding to UTF-8, I hope so! Setting the default encoding to UTF-8 for random files is harmful in the Emacs context, especially with the current fragile UTF-8 implementation. thus I see mojibake every time I open a file which looks perfectly nice through cat or vi in xterm. But we heard the Emacs developers are working on this lately. Yes, the specific problems are solved. It isn't a big deal actually, but apparently no one actually tried to run Emacs on a multibyte terminal, but a few months ago, some guy from Germany (not me, BTW) triggered a general bug in the Emacs keyboard coding system in this context which has reportedly been fixed in the development sources. Anyway, you can run a suitably recent version of Emacs (probably not the Emacs 21 branch, however) inside an UTF-8 xterm and it works mainly as expected. Actually, I've got access to Emacs 20 with MULE-UCS only, and the results are promising indeed. I didn't check that the notions of full width characters match and other sophisticated stuff, but the HELLO file displays quite nicely. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Set Character Width Proposal (Version 3)
Markus Kuhn [EMAIL PROTECTED] writes: Here is another iteration of the SCW control function definition, to allow users of terminal emulators full control over whether single-width or double-width glyphs will be used: Why don't you use the Unicode tagging mechanism (or some special Unicode characters)? I think this makes sense even in plain text, and not only when communicating with terminal devices? - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: wchar_t -- Unicode Conversion
Michael B. Allen [EMAIL PROTECTED] writes: Why doesn't wchar_t play nice with Unicode? It does, if your C implementation defines the macro name __STDC_ISO_10646__ (see the C standard for additional information). - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
UTF-8 in RFC 2279 and ISO 10646
Sorry for this question which is slightly off topic: Are the UTF-8 definitions in ISO/IEC 10646-1:200 and RFC 2279 identical or equivalent? Can any harm result if a nomative document refers to both definitions (this is a bad idea if the definitions are slightly different). And BTW: Does ISO 10646 define character properties (such as lowercase letter, uppercase letter, titlecase letter, other letter, decimal digit, other digit and so on)? - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: REVERSE SOLIDUS in JIS0208.TXT
Markus Kuhn [EMAIL PROTECTED] writes: Note that we have the exact same problem with various European/American encodings such as CP437, where IBM and Microsoft came up with radically different and incompatible mappings If I'm not mistaken, at least one character in CP437 has even been reassigned. Older graphics hardware and printers interpret 0xe1 as U+03B2 GREEK SMALL LETTER BETA, and not as U+00DF LATIN SMALL LETTER SHARP S, which can be quite annoying if you need the latter because the glyphs are clearly different. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Unicode is optimal for Chinese/Japanese multilingual texts
"H. Peter Anvin" [EMAIL PROTECTED] writes: The Chinese Academy Of Sciences has published a set of scalable fonts in several styles, but unfortunately in a proprietary format with closed-source converters to PK format for usage with TeX. Is there any descriptions of this format? I didn't find one when I looked for it a few years ago. Perhaps the format description is available in Chinese, but I can't read that. What kinds of curves does it use? I'm not sure if it uses curves at all. :-/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Unicode is optimal for Chinese/Japanese multilingual texts
Tomohiro KUBOTA [EMAIL PROTECTED] writes: I don't know about Chinese and Korean font projects. The Chinese Academy Of Sciences has published a set of scalable fonts in several styles, but unfortunately in a proprietary format with closed-source converters to PK format for usage with TeX. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Doublewidth Cyrillic for unhappy Japanese people
Markus Kuhn [EMAIL PROTECTED] writes: The only characters for which double-width (square) is appropriate are - Han ideographs - Hiragana/Katakana - Hangul - CJK punctuation - fullwidth forms There are a few other characters which simply can't be displayed properly using single-width glyphs, for example: U+222D TRIPLE INTEGRAL U+24A8 PARENTHESIZED LATIN SMALL LETTER M U+FB03 LATIN SMALL LIGATURE FFI U+FB04 LATIN SMALL LIGATURE FFL U+2473 CIRCLED NUMBER TWENTY U+2487 PARENTHESIZED NUMBER TWENTY U+24DC CIRCLED LATIN SMALL LETTER M - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Doublewidth block graphics for unhappy MS-DOS users
Markus Kuhn [EMAIL PROTECTED] writes: CP437/CP850 is still used today in the MS-DOS box on *every* Windows98 machine in West Europe/US/etc. These codepages are also used on IBM operating systems such as OS/2 and AIX, I guess. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Doublewidth Cyrillic for unhappy Japanese people
Martin Norbck [EMAIL PROTECTED] writes: I think this is a simple issue of counting the vertical lines in the glyph. I think that's to coarse. There might be some cases in which existing monospace fonts treat characters as single-width because systems with 9x16 or 8x8 glyph cells are much more commonly used than 6x13 cells. In such cases, compatibility should be preserved. The latin ligatures should be double witdh as well, but who uses them in plain text? I guess people who play with Unicode to upset other people. ;-) As for the EM DASH, typhographically it should perhaps be double width, but we aren't dealing with typography. As long as it's readable, I would rather see as few double width characters as possible. I think it has to be double-width in order to see that it's not an EN DASH. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: multilingual man pages
Bruno Haible [EMAIL PROTECTED] writes: Wouldn't it be better to use standard names in all cases, and use a simple Emacs lisp function to convert the standard name to an Emacs name? The Emacs PO mode already has code for this. I think Gnus implements a different, but similar functionality, based on the value of 'mm-mime-mule-charset-alist' and the 'mime-charset' coding system attribute. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Doublewidth EM DASH for unhappy English people
Markus Kuhn [EMAIL PROTECTED] writes: I see actually no big problem to make all the circled and parenthesised numbers and letters doublewidth in the standard wcwidth, or even the EM DASH. It would just mean that the definition of wcwidth becomes an actual design issue, and not just like it is at the moment a function rather strictly derived from a Unicode database property. I guess an additional character property is needed for this, although this is rather a glyph property. Perhaps some special combining characters (FORCE DOUBLE WIDTH, FORCE SINGLE WIDTH) could be helpful as well, at least for communication with terminal emulators. I also suspect that Japanese users will not really want to insist on doublewidth European letters. The only point of conflict that I see are the block graphics characters, as they are used in both communities widely with their respective widths. There are also quite a few scripts which feature a combination of simple and rather complex glyphs (the latter don't fit well into a single-width box). Cyrillic and the Latin Serbo-Croatian transliteration are examples, and Arabic, Devanagari, and Tibetan are actually displayed with both single- and double-width glyphs by Emacs. In addition, there are combining characters which substantially change the width of the character they are applied to. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: TCL/Tk and ISO10646-1 fonts
Markus Kuhn [EMAIL PROTECTED] writes: It seems that the soon to be released new TCL/Tk 8.3.3 is finally going to be able to use *-iso10646-1 fonts directly, thanks to recent patches by Jeff Hobbs [EMAIL PROTECTED] and Brent Welch [EMAIL PROTECTED]. BTW, what about their UTF-8 decoder? Does it still accept overlong sequences and fallback to ISO-8859-1 if it's unable to decode some characters? - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: iconv in glibc
Bruno Haible [EMAIL PROTECTED] writes: Edmund GRIMLEY EVANS asked on 1999-11-25: Will iconv() in glibc-2.2 convert from utf-7? Yes. It has been added to glibc-2.2 in order to cope with email messages sent out in this encoding by some mailers in East Asia. I've seen quite a few messages originated in Germany as well. Some of the Usenet agents by Microsoft can be (mis)configured to use it, it seems. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder
Edmund GRIMLEY EVANS [EMAIL PROTECTED] writes: B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence This is what I do in Mutt. It's easy to implement and works for any multibyte encoding; the program doesn't have to know about UTF-8. This is what I recommend at the moment, with two exceptions: For UTF-8-to-UTF-16 translation, an UCS-4 character which can't be represented in UTF-16 is replaced with a single replacement character. This also applies to syntactically correct UTF-8 sequences which are either overlong or encode code positions such as surrogates which are forbidden in UTF-8. D) Emit a malformed UTF-16 sequence for every byte in a malformed UTF-8 sequence Not much good if you're not converting to UTF-16. Well, it works with UCS-4 as well (but I would use a private area for this kind of stuff until it's generally accepted practice to do such hacks with surrogates). I think D) could be yet another translation method (in addition to "error" and "replace"), but it shouldn't be the only one a UTF-8 library provides. With method D), your UTF-8 *encoder* might create an invalid UTF-8 stream, which is certainly not desirable for some applications. It's unfortunate that the current UTF-8 stuff for Emacs causes malformed UTF-8 files to be silently trashed. Yes, that's quite annoying. But the whole MULE stuff is a big mess. In-band signalling everywhere. :-( (Some byte sequences in a single-byte buffer do very strange things.) - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/