Re: Switching to UTF-8

2002-05-01 Thread Florian Weimer

Markus Kuhn [EMAIL PROTECTED] writes:

   c) Emacs - Current Emacs UTF-8 support is still a bit too provisional
  for my comfort. In particular, I don't like that the UTF-8 mode is not
  binary transparent. Work on turning Emcas completely into a UTF-8
  editor is under way, and I'd be very curious to hear about the
  current status and whether there is anything to test already.
  Anyone?

AFAIK, there is some activity on the Emacs 22 branch.  XEmacs is in
the process of switching to UCS for its internal character set, too.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Please do not use en_US.UTF-8 outside the US

2002-05-01 Thread Florian Weimer

Markus Kuhn [EMAIL PROTECTED] writes:

 As we are talking about en_US.UTF-8:

 General warning: Please do not use the locale name en_US.UTF-8 anywhere
 outside North America.

Why can't you use it for LC_CTYPE and LC_MESSAGES, say?

Determining paper size by locale is rather strange.  What's next?
Keyboard layout?  Mouse orientation?  Monitor size?
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: POSIX:2001 now available online

2002-02-13 Thread Florian Weimer

Markus Kuhn [EMAIL PROTECTED] writes:

 The revised POSIX standard, which has been merged with the Single UNIX
 Specification is now available online in HTML!

It is complicated to look up sections by their number.  Or am I
missing something?
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: [I18n]Re: Li18nux Locale Name Guideline Public Review

2002-01-22 Thread Florian Weimer

Bram Moolenaar [EMAIL PROTECTED] writes:

 Ignoring case does not appear to lead to compatibility problems.

It does. Case is used to separate public and private namespace
(probably a design mistake).  However, we shuld ignore case in the
charset: we are going to use mainly MIME charset names (at least I
hope so), and MIME charset names are case insensitive.

Anyway, the case sensitivity issue is a strawman, IMHO.  If there is a
single, system-wide locale database, the name of a locale becomes much
less an issue (as it should never be transmitted over the wire).
Current experience with GNU libc and XFree86 4.1.x shows that
enriching locale data based on the name simply does not work.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Free availability of ISO/IEC standards

2002-01-04 Thread Florian Weimer

Keld Jørn Simonsen [EMAIL PROTECTED] writes:

 Can't you get access to them in the onsite department of the library?
 (That is, the department where you cannot loan the books, but only
 read them onsite).

No, definitely not.  The librarians don't even know how to get those
standards (ISO and IEEE).

There are *copies* of DIN standards in the university library, but the
archive is far from complete, and you never know (without independent
checking) if you've missed an important TC.

Thanks to modern technology, I can query the quite a few library
catalogs simultaneously.  For example, the libraries in Southwest
Germany have got books with coded character sets in the title, but
all of them are ECMA standards.

 I gather that I am in a lucky position living in a big city in one
 of the more developed countries of the world, but generally
 universities in all countries at least in the industrialized world
 have systems so a student can get hold of any major technical book
 (this is essential for a university to fullfill its mission) and
 often general public can get access too if they are persistant
 enough.

Of course, you can ask the university to buy the book.  I've been told
this wouldn't be a problem, although you can't use the $18 PDF option
in this case.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: unicode in emacs 21

2001-11-04 Thread Florian Weimer

Eli Zaretskii [EMAIL PROTECTED] writes:

 The GNU Emacs/Unicode proposal I've seen seems to have this property,
 too.  (At least the proposal is ambiguous, and one interpretation is
 that you can encode a single character in multiple ways.)

 Unless you refer to the CNS plane and Japanese Han characters, which
 were deliberately left ununified (in addition to the Unicode
 codepoints for those characters), I think you are mistaken.

I hope so. ;-)

 Could you please point out where in the proposal do you see that a
 character can be encoded in multiple ways?

I think now that the surrogate stuff has been explained, the encoding
to to UCS-E (Unicode-compatible Character Set for Emacs) is indeed
unambiguous.

However, UTF-E (the buffer encoding) opens possibilities for different
encodings of the same UCS-E code point, but this can be resolved, I
think.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Unicode in Emacs again

2001-11-04 Thread Florian Weimer

Kenichi Handa [EMAIL PROTECTED] writes:

 Florian Weimer [EMAIL PROTECTED] writes:
 What does 'via surrogate pair' mean?  I guess the second line should
 read:

00      Unicode 20bit (U+1 - U+F)

 Yes.   That's correct, and the third line shoud read as below:

01      Unicode 20bit (U+10 - U+10)

I'm still not convinced it's correct.  My current understanding is
that it should be:

  00      Unicode 20 bit   (U+00 - U+0F)
  01      Unicode 20.08... bit (U+10 - U+10)

I'm currently reading the emacs-unicode mailing list, and it seems a
few essential issues weren't on the horizon back then.  Shall I send a
comment to the emacs-unicode mailing list if I'm finished?
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




A more verbose version of Emacs-Unicode-990824

2001-11-04 Thread Florian Weimer

Unicode Support for GNU Emacs
*

   This memo documents the current plans for bringing Unicode to GNU
Emacs.  It describes the requirements and constraints for the new Emacs
character set, and the transformation format used to encode this
character set in buffers.

   It reflects the discussion on the `emacs-unicode' mailing list and
the `Emacs-Unicode-990824' proposal.

   Version $Revision: 1.1 $, written by Florian Weimer.

Requirements


   The internal character code of a character has to fit in 22 bits.
(The remaining bits of a 32 bit host integer are required for tagging.)

   The representation of characters in buffers and strings has to be
compact.  22 and more bits per ASCII character are not acceptable.

   Latin scripts are unified.

   There are strong reservations regarding Han unification.  Emacs must
be able to display Han characters using a font which matches the
expectations of CJKV users.

   In addition, there are some character sets for which no corresponding
code points have been assigned yet in Unicode.

   The Emacs character set should deviate as little as possible from the
Unicode character set (and similarly, from other included character
sets).  Each deviation has to be documented, and since documentation is
now widely available [Unicode], it does not make sense to rewrite this
documentation from scratch.

   (Up to this point, these requirements were mentioned in previous
discussions on the `emacs-unicode' mailing list.)

   We should assume that UTF-8 [RFC 2279] becomes the dominant
character set on GNU systems.  Users will want to enable it by default.
Therefore, we have to guarantee the following things:

   * Emacs must be able to read any file in UTF-8, even if it contains
 invalid UTF-8 sequences.

   * If a file is read into Emacs and written again without editing,
 the written file must match the original, including possibly broken
 UTF-8 sequences.

   * If the user instructs Emacs to read a file, edits a certain part,
 and writes it back, portions wich have not been edited should not
 change in any way (even in the presence of broken UTF-8 sequences).

   On some proprietary platforms, there is a strong trend towards
UTF-16, and similar requirements apply there (with broken surrogate
pairs instead broken UTF-8 sequences).

Rejected Requirements
=

   Latin unification means that it is not possible to read an ISO 2022
encoded file (which might contain several scripts from ISO 8859 unified
in Unicode), and write it back again, so that it matches the original.
In addition, the shape of accents varies from one Latin script to
another, and those accents are unified in Unicode.  This might
introduce slight typographic in accuracies if the wrong font is chosen,
which seem, however, to be acceptable in a text editor.

Tools Available for Implementation
==

   We can achieve the Latin unification by either carefully unifying the
existing MULE charsets, or by switch to Unicode.  Because of other
requirements, in particular documentation, the latter seems to
desirable.

   There are several approaches for working around Han unification:

   * plane 14 language tags [Plane14] (now an official part of Unicode)

   * text properties

   * separate CJKV character sets (in particular for KJV users, C seems
 to be not so problematic)

   A language tag in each character is not possible because of the
22 bit limit for a character code.

   Because of the need for a Han unification workaround, straightforward
UCS-4 cannot be used for the Emacs character set.

The Current Proposal


   The GNU Emacs Emacs Proposal consists of two parts: A character set,
and an encoding of this character set for use in buffers and strings.

   Basic semantics have not been discussed much yet.

The Emacs Character Set
---

   The Unicode-compatible Character Set for Emacs (UCS-E) is based on
UCS-4.  In the following, we use the U+ABCDEF notation (where ABCDEF
are hexadecimal digits) to refer to UCS-4 characters, and the E+ABCDEF
notation to refer to characters in UCS-E.

   The character range E+00 up to E+10 is identical to UCS-4
(U+00 up to U+10, 17 planes of 65,536 code points each).  This
is exactly the range which is addressable using surrogate pairs and
UTF-16.

   However, the planes beyond this range are used differently: planes 17
to 23 are reserved for Emacs (E+11-E+17), planes 24 to 31 are
intended for private use (E+18-E+1F), and planes 32 to 63 are
partly used for encoding CJK characters, partly for private use
characters (E+20-E+3F).  This results in the following picture,
with bit masks in the first column:

 00      Unicode U+00 - U+0F
 01      Unicode U+10 - U+10
 01 0ppp     7 64K planes reserved

Re: unicode in emacs 21

2001-10-30 Thread Florian Weimer

Richard Stallman [EMAIL PROTECTED] writes:

 Supporting Unicode superficially while retaining the current internal
 representation raises a number of problems, one of them being that the
 internal representation has several alternatives for the same character
 which correspond to the same code point in Unicode.

The GNU Emacs/Unicode proposal I've seen seems to have this property,
too.  (At least the proposal is ambiguous, and one interpretation is
that you can encode a single character in multiple ways.)
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-28 Thread Florian Weimer

H. Peter Anvin [EMAIL PROTECTED] writes:

 Does that mean you're painting yourself into a corner, though,
 requiring manual work to integrate the increasingly Unicode-based
 infrastructure support that is becoming available?  Odds are pretty
 good that they are.

I don't think it is a good idea to use operating system Unicode
support.  This would mean that GNU Emacs behaves differently on
different operating systems, depending on the installed locale
descriptions, for example.

OTOH, the character encodings posted earlier to this list are as
incompatible with existing Unicode support as the current emacs-mule
internal encoding.  In effect, just one Emacs-specific internal
encoding is replaced by another.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-28 Thread Florian Weimer

Eli Zaretskii [EMAIL PROTECTED] writes:

 Why can't you continue to use the MULE code and just change the
 character sets to reflect certain aspects of Unicode?
 
 The current plan for Unicode was discussed at length 3 years ago, and
 the result was what I described.

Is the discussion archived somewhere, or are there some design
documents which resulted from the discussion?

 I don't think it's wise for us to reopen that discussion again,
 unless you think the UTF-8-based representation is a terribly wrong
 design.

Of course, it's hard to come up with constructive criticism when you
don't know what's already there. ;-)

 So I don't see any reason for the unnamed Unicode people to get
 annoyed by a term they themselves coined.

Me neither, but I got flamed in the past. :-/

 Conceivably, changing the internal representation doesn't mean we need
 to rewrite all of the existing code, just the low-level parts of it
 that deal with code conversions (i.e. subroutines of encoding and
 decoding functions).

I still don't understand the need for such a change.  In theory, the
internal representation of characters should be invisible to the
higher levels.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-27 Thread Florian Weimer

Eli Zaretskii [EMAIL PROTECTED] writes:

 Emacs cannot use a pure UTF-8 encoding, since some cultures don't want
 unification, and it was decided that Emacs should not force
 unification on those cultures.

Why can't you continue to use the MULE code and just change the
character sets to reflect certain aspects of Unicode?  One such aspect
is Latin unification, for example.  (The Unicode people get very
annoyed if you talk about unification, source separation rule etc.
in the context of non-Han scripts...)

In a second step, support for normalization, combining characters
etc. would have to be added, but this could be based on the reliable
foundation of the old MULE code.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: UTF16 and GCC

2001-08-08 Thread Florian Weimer

[EMAIL PROTECTED] (Kai Henningsen) writes:

  * Do we need a native wide char encoding, too (mostly for Win32 where
  it's UTF-16, but possibly also some Asian thing)?

 A single 'char' encoded in UTF-16?  This sounds horrible.

 I can't quite parse that.

If you've got a 16 bit wchar_t, there's no way that it can store
characters encoded in UTF-16.  What happens to characters outside the
BMP?

16 bit wchar_t on C makes only sense in conjunction with UCS-2.  All C
functions working on wide characters can only deal with characters in
the BMP anyway, even if you permit encoding wchar_t * strings in
UTF-16.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: UTF16 and GCC

2001-08-05 Thread Florian Weimer

[EMAIL PROTECTED] (Kai Henningsen) writes:

 * Do we need a native wide char encoding, too (mostly for Win32 where  
 it's UTF-16, but possibly also some Asian thing)?

A single 'char' encoded in UTF-16?  This sounds horrible.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Word and Antiword

2001-07-14 Thread Florian Weimer

Markus Kuhn [EMAIL PROTECTED] writes:

 Antiword is available from
 
   http://www.winfield.demon.nl/
 
 and provides significantly better DOC - plaintext conversion
 than any Micorsoft product.

Unfortunately, this is not true.  It fails badly on Word documents
with embedded change history, like any other third-party converter
I've tested so far.  This can be quite dangerous because the extracted
plaintext can differ substantially from what a Word users sees on the
screen.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-27 Thread Florian Weimer

Bruno Haible [EMAIL PROTECTED] writes:

 The programs we are waiting for are:
 
   - emacs. In an UTF-8 locale, it does not set the
 keyboard-coding-system to UTF-8, thus when I type umlaut keys
 strange things happen. And it does not set the default file
 encoding to UTF-8,

I hope so!  Setting the default encoding to UTF-8 for random files is
harmful in the Emacs context, especially with the current fragile
UTF-8 implementation.

 thus I see mojibake every time I open a
 file which looks perfectly nice through cat or vi in xterm.
 But we heard the Emacs developers are working on this lately.

Yes, the specific problems are solved.  It isn't a big deal actually,
but apparently no one actually tried to run Emacs on a multibyte
terminal, but a few months ago, some guy from Germany (not me, BTW)
triggered a general bug in the Emacs keyboard coding system in this
context which has reportedly been fixed in the development sources.

Anyway, you can run a suitably recent version of Emacs (probably
not the Emacs 21 branch, however) inside an UTF-8 xterm and it
works mainly as expected.  Actually, I've got access to Emacs 20
with MULE-UCS only, and the results are promising indeed.  I didn't
check that the notions of full width characters match and other
sophisticated stuff, but the HELLO file displays quite nicely.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Set Character Width Proposal (Version 3)

2001-06-24 Thread Florian Weimer

Markus Kuhn [EMAIL PROTECTED] writes:

 Here is another iteration of the SCW control function definition, to
 allow users of terminal emulators full control over whether single-width
 or double-width glyphs will be used:

Why don't you use the Unicode tagging mechanism (or some special
Unicode characters)?  I think this makes sense even in plain text, and
not only when communicating with terminal devices?
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: wchar_t -- Unicode Conversion

2001-06-02 Thread Florian Weimer

Michael B. Allen [EMAIL PROTECTED] writes:

 Why doesn't wchar_t play nice with Unicode?

It does, if your C implementation defines the macro name
__STDC_ISO_10646__ (see the C standard for additional information).
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



UTF-8 in RFC 2279 and ISO 10646

2001-05-01 Thread Florian Weimer

Sorry for this question which is slightly off topic:

Are the UTF-8 definitions in ISO/IEC 10646-1:200 and RFC 2279
identical or equivalent?  Can any harm result if a nomative document
refers to both definitions (this is a bad idea if the definitions are
slightly different).

And BTW: Does ISO 10646 define character properties (such as lowercase
letter, uppercase letter, titlecase letter, other letter, decimal
digit, other digit and so on)?
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: REVERSE SOLIDUS in JIS0208.TXT

2001-04-15 Thread Florian Weimer

  Markus Kuhn [EMAIL PROTECTED] writes:

 Note that we have the exact same problem with various European/American
 encodings such as CP437, where IBM and Microsoft came up with radically
 different and incompatible mappings

If I'm not mistaken, at least one character in CP437 has even been
reassigned.  Older graphics hardware and printers interpret 0xe1 as
U+03B2 GREEK SMALL LETTER BETA, and not as U+00DF LATIN SMALL LETTER
SHARP S, which can be quite annoying if you need the latter because
the glyphs are clearly different.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Unicode is optimal for Chinese/Japanese multilingual texts

2001-04-14 Thread Florian Weimer

  "H. Peter Anvin" [EMAIL PROTECTED] writes:

  The Chinese Academy Of Sciences has published a set of scalable fonts
  in several styles, but unfortunately in a proprietary format with
  closed-source converters to PK format for usage with TeX.
  
 
 Is there any descriptions of this format?

I didn't find one when I looked for it a few years ago.  Perhaps the
format description is available in Chinese, but I can't read that.

 What kinds of curves does it use?

I'm not sure if it uses curves at all. :-/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Unicode is optimal for Chinese/Japanese multilingual texts

2001-04-11 Thread Florian Weimer

Tomohiro KUBOTA [EMAIL PROTECTED] writes:

 I don't know about Chinese and Korean font projects.

The Chinese Academy Of Sciences has published a set of scalable fonts
in several styles, but unfortunately in a proprietary format with
closed-source converters to PK format for usage with TeX.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Doublewidth Cyrillic for unhappy Japanese people

2001-04-11 Thread Florian Weimer

Markus Kuhn [EMAIL PROTECTED] writes:

 The only characters for which double-width (square) is appropriate are
 
   - Han ideographs
   - Hiragana/Katakana
   - Hangul
   - CJK punctuation
   - fullwidth forms

There are a few other characters which simply can't be displayed
properly using single-width glyphs, for example:

U+222D TRIPLE INTEGRAL
U+24A8 PARENTHESIZED LATIN SMALL LETTER M
U+FB03 LATIN SMALL LIGATURE FFI
U+FB04 LATIN SMALL LIGATURE FFL
U+2473 CIRCLED NUMBER TWENTY
U+2487 PARENTHESIZED NUMBER TWENTY
U+24DC CIRCLED LATIN SMALL LETTER M
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Doublewidth block graphics for unhappy MS-DOS users

2001-04-11 Thread Florian Weimer

Markus Kuhn [EMAIL PROTECTED] writes:

 CP437/CP850 is still used today in the MS-DOS box on *every* Windows98
 machine in West Europe/US/etc.

These codepages are also used on IBM operating systems such as OS/2
and AIX, I guess.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Doublewidth Cyrillic for unhappy Japanese people

2001-04-11 Thread Florian Weimer

Martin Norbck [EMAIL PROTECTED] writes:

 I think this is a simple issue of counting the vertical lines in the
 glyph.

I think that's to coarse.  There might be some cases in which existing
monospace fonts treat characters as single-width because systems with
9x16 or 8x8 glyph cells are much more commonly used than 6x13 cells.
In such cases, compatibility should be preserved.

 The latin ligatures should be double witdh as well, but who uses them in
 plain text?

I guess people who play with Unicode to upset other people. ;-)

 As for the EM DASH, typhographically it should perhaps be double width,
 but we aren't dealing with typography. As long as it's readable, I would
 rather see as few double width characters as possible.

I think it has to be double-width in order to see that it's not an EN
DASH.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: multilingual man pages

2001-04-11 Thread Florian Weimer

Bruno Haible [EMAIL PROTECTED] writes:

 Wouldn't it be better to use standard names in all cases, and use a
 simple Emacs lisp function to convert the standard name to an Emacs
 name?  The Emacs PO mode already has code for this.

I think Gnus implements a different, but similar functionality, based
on the value of 'mm-mime-mule-charset-alist' and the 'mime-charset'
coding system attribute.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Doublewidth EM DASH for unhappy English people

2001-04-11 Thread Florian Weimer

Markus Kuhn [EMAIL PROTECTED] writes:

 I see actually no big problem to make all the circled and parenthesised
 numbers and letters doublewidth in the standard wcwidth, or even the EM
 DASH. It would just mean that the definition of wcwidth becomes an
 actual design issue, and not just like it is at the moment a function
 rather strictly derived from a Unicode database property.

I guess an additional character property is needed for this, although
this is rather a glyph property.  Perhaps some special combining
characters (FORCE DOUBLE WIDTH, FORCE SINGLE WIDTH) could be helpful
as well, at least for communication with terminal emulators.

 I also suspect that Japanese users will not really want to insist on
 doublewidth European letters. The only point of conflict that I see
 are the block graphics characters, as they are used in both
 communities widely with their respective widths.

There are also quite a few scripts which feature a combination of
simple and rather complex glyphs (the latter don't fit well into a
single-width box).  Cyrillic and the Latin Serbo-Croatian
transliteration are examples, and Arabic, Devanagari, and Tibetan are
actually displayed with both single- and double-width glyphs by Emacs.
In addition, there are combining characters which substantially change
the width of the character they are applied to.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: TCL/Tk and ISO10646-1 fonts

2001-04-08 Thread Florian Weimer

Markus Kuhn [EMAIL PROTECTED] writes:

 It seems that the soon to be released new TCL/Tk 8.3.3 is finally going
 to be able to use *-iso10646-1 fonts directly, thanks to recent patches
 by Jeff Hobbs [EMAIL PROTECTED] and Brent Welch [EMAIL PROTECTED].

BTW, what about their UTF-8 decoder?  Does it still accept overlong
sequences and fallback to ISO-8859-1 if it's unable to decode some
characters?
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: iconv in glibc

2000-09-30 Thread Florian Weimer

  Bruno Haible [EMAIL PROTECTED] writes:

 Edmund GRIMLEY EVANS asked on 1999-11-25:
 
  Will iconv() in glibc-2.2 convert from utf-7?
 
 Yes. It has been added to glibc-2.2 in order to cope with email
 messages sent out in this encoding by some mailers in East Asia.

I've seen quite a few messages originated in Germany as well. Some of
the Usenet agents by Microsoft can be (mis)configured to use it, it
seems.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-25 Thread Florian Weimer

  Edmund GRIMLEY EVANS [EMAIL PROTECTED] writes:

  B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence
 
 This is what I do in Mutt. It's easy to implement and works for any
 multibyte encoding; the program doesn't have to know about UTF-8.

This is what I recommend at the moment, with two exceptions: For
UTF-8-to-UTF-16 translation, an UCS-4 character which can't be
represented in UTF-16 is replaced with a single replacement character.
This also applies to syntactically correct UTF-8 sequences which are
either overlong or encode code positions such as surrogates which are
forbidden in UTF-8.

  D) Emit a malformed UTF-16 sequence for every byte in a malformed
 UTF-8 sequence
 
 Not much good if you're not converting to UTF-16.

Well, it works with UCS-4 as well (but I would use a private area for
this kind of stuff until it's generally accepted practice to do such
hacks with surrogates).

I think D) could be yet another translation method (in addition to
"error" and "replace"), but it shouldn't be the only one a UTF-8
library provides.  With method D), your UTF-8 *encoder* might create
an invalid UTF-8 stream, which is certainly not desirable for some
applications.

 It's unfortunate that the current UTF-8 stuff for Emacs causes
 malformed UTF-8 files to be silently trashed.

Yes, that's quite annoying.  But the whole MULE stuff is a big
mess.  In-band signalling everywhere. :-( (Some byte sequences in a
single-byte buffer do very strange things.)
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/