Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-28 Thread Bruno Haible

Markus Kuhn writes:
  The appearance of U+FFFD is a kind of error message.
 
 Agreed. And the appearance of a U+DCxx (which in UTF-16 is not preceded
 by a high sorrugate) is equally "a kind of error message". Just one that
 contains a bit (well, seven :-) more information.

The difference is that application writers know how to deal with
U+FFFD (hollow box, width 1, etc.) But if a byte 0xBB - U+DC3B
appears, applications don't know whether it represents an ISO-8859-1
0xBB (angle quotation mark) or an ISO-8859-6 0xBB (arabic semicolon).

 I see valuable binary data (PDF  ZIP files, etc.) being destroyed
 almost every day by accidentally applied stupid lossy CRLF - LF - CRLF
 data conversion that supposedly smart software tries to perform on the
 fly.

It's a problem of the applications. Some application writers think
that "as many automatic conversions as possible" and "as many
heuristics as possible" qualifies as smart. Try and teach them.

 I foresee similar non-recoverable data conversion accidents if we
 try to establish software that wipes out malformed UTF-8 sequence
 without mercy and destructs all information that they might have
 contained.

I like the way Emacs deals with the problem of (sometimes necessary)
conversions: When there is an ambiguity, it asks the user. When I take
an ISO-8859-1 file with German umlauts and paste a few Chinese
ideograms into it and then attempt to save it, it warns me that the
new characters won't fit with the existing file encoding and asks me
to choose another file encoding.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-28 Thread Edmund GRIMLEY EVANS

Markus Kuhn [EMAIL PROTECTED]:

 I see valuable binary data (PDF  ZIP files, etc.) being destroyed
 almost every day by accidentally applied stupid lossy CRLF - LF - CRLF
 data conversion that supposedly smart software tries to perform on the
 fly. I foresee similar non-recoverable data conversion accidents if we
 try to establish software that wipes out malformed UTF-8 sequence
 without mercy and destructs all information that they might have
 contained.

Here the problem is that the program is misconverting on the fly and
not giving an error. If the program stopped with an error half way
through the user would know there was a problem and be able to do
something about it.

So, I don't think a UTF-8 decoder, as implemented in a library, should
do anything other than give an error if it encounters malformed UTF-8.
The user should be told that something has gone wrong. Clever
reversible conversion of malformed sequences is more likely to hide a
real problem, causing a bigger problem later, than to be helpful, I
suspect.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-27 Thread Bruno Haible

Markus Kuhn's proposal D:
 All the previous options for converting malformed UTF-8 sequences to
 UTF-16 destroy information. ...
 Malformed UTF-8 sequences consist excludively of the bytes 0x80 -
 0xff, and each of these bytes can be represented using a 16-bit
 value ...
 This way 100% binary transparent UTF-8 - UTF-16/32 - UTF-8 round-trip
 compatibility can be achieved quite easily.

I don't like this proposal for a few reasons:

* What interoperable and reliable software needs, is a clear and
  standardized interchange format. It must say "this is allowed" and
  "that is forbidden". If after a few years a standard starts saying
  "this was forbidden but is now allowed", then older software will
  not accept output from newer programs any more. And the result will
  be just like the mess we had around 1992 when some but not all Unix
  software was 8-bit clean.

* A program which does something halfway intelligent, like the "fmt"
  line breaking program, needs to make assumptions about the
  characters it is treating. (In the case of fmt: recognize spaces and
  newlines, and know about their width.) The input is UTF-8 and is
  converted to UCS-4 via fgetwc. If this UCS-4 stream now contains
  characters which are only substitutes for *unknown* characters, the
  fmt program will never know the width of these. It will thus output
  (again in UTF-8) the original characters, but will not have done the
  correct line breaking.

  In summary, this leads to "garbage in - garbage out" behaviour of
  programs. Whereas a central point of Unicode is that applications
  know the behaviour of *all* characters, definitely.

  I much prefer the "garbage in - error message" way, because it
  enables the user or sysadmin to fix the problem (read: call recode
  on the data files). The appearance of U+FFFD is a kind of error
  message.

* One of your most prominent arguments for the adoption of UTF-8 is
  that it in 99.99% of the cases an UTF-8 encoded file can easily
  distinguished from an ISO-8859-1 encoded one. If UTF-8 were extended
  so that lone bytes in the range 0x80..0xBF were considered valid,
  this argument would fall apart.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-25 Thread Florian Weimer

  Edmund GRIMLEY EVANS [EMAIL PROTECTED] writes:

  B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence
 
 This is what I do in Mutt. It's easy to implement and works for any
 multibyte encoding; the program doesn't have to know about UTF-8.

This is what I recommend at the moment, with two exceptions: For
UTF-8-to-UTF-16 translation, an UCS-4 character which can't be
represented in UTF-16 is replaced with a single replacement character.
This also applies to syntactically correct UTF-8 sequences which are
either overlong or encode code positions such as surrogates which are
forbidden in UTF-8.

  D) Emit a malformed UTF-16 sequence for every byte in a malformed
 UTF-8 sequence
 
 Not much good if you're not converting to UTF-16.

Well, it works with UCS-4 as well (but I would use a private area for
this kind of stuff until it's generally accepted practice to do such
hacks with surrogates).

I think D) could be yet another translation method (in addition to
"error" and "replace"), but it shouldn't be the only one a UTF-8
library provides.  With method D), your UTF-8 *encoder* might create
an invalid UTF-8 stream, which is certainly not desirable for some
applications.

 It's unfortunate that the current UTF-8 stuff for Emacs causes
 malformed UTF-8 files to be silently trashed.

Yes, that's quite annoying.  But the whole MULE stuff is a big
mess.  In-band signalling everywhere. :-( (Some byte sequences in a
single-byte buffer do very strange things.)
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-23 Thread Edmund GRIMLEY EVANS

Markus Kuhn [EMAIL PROTECTED]:

 A) Emit a single U+FFFD per malformed sequence

We discussed this before. I can think of several ways of interpreting
the phrase "malformed sequence".

I think you probably mean either a single octet in the range 80..BF or
a single octet in the range FE..FF or an octet in the range C0..FD
followed by any number of octets in the range 80..BF such that it
isn't correct UTF-8 and isn't followed by another octet in the range
80..BF.

This is probably quite hard to implement consistently, and, as with
semantics C, the UTF-8/UTF-16 length ratio is unbounded, which means
in particular that you can't decode from a fixed-size buffer in the
manner of mbrtowc.

 B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence

This is what I do in Mutt. It's easy to implement and works for any
multibyte encoding; the program doesn't have to know about UTF-8.

But you have to ask yourself: do I reset the mbstate_t when I replace
a bad byte by U+FFFD? If you want consistency, you probably should, as
otherwise the mbstate_t is undefined after mbrtowc gives EILSEQ.

 C) Emit a U+FFFD only for every first malformed sequence in a sequence
of malformed UTF-8 sequences

I don't think anyone will recommend this.

 D) Emit a malformed UTF-16 sequence for every byte in a malformed
UTF-8 sequence

Not much good if you're not converting to UTF-16.

So perhaps B should be the generally recommended way.

However, I agree that a UTF-8 editor should be able to remember
malformed UTF-8 sequences so that you can read in a file, edit part of
it and write it out again without it all being rubbished.

It's unfortunate that the current UTF-8 stuff for Emacs causes
malformed UTF-8 files to be silently trashed.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/