Re: Substituting malformed UTF-8 sequences in a decoder

David Starner Thu, 27 Jul 2000 15:00:02 -0700

On Thu, Jul 27, 2000 at 11:45:59PM +0200, Bruno Haible wrote:
>   converted to UCS-4 via fgetwc. If this UCS-4 stream now contains
>   characters which are only substitutes for *unknown* characters, the
>   fmt program will never know the width of these. It will thus output
>   (again in UTF-8) the original characters, but will not have done the
>   correct line breaking.
> 
>   In summary, this leads to "garbage in - garbage out" behaviour of
>   programs. Whereas a central point of Unicode is that applications
>   know the behaviour of *all* characters, definitely.

Okay, what's the width of U+F001? U+1EFA? The first is private use, and
the second hasn't been defined yet, but quite possibly will. The Unicode
standard at the least encourages Unicode conformant programs to deal with
unknown characters, requiring it in some cases (C10). Unicode is an open
repritore. Applications don't nessecarily know the behavior of all 
characters.

-- 
David Starner - [EMAIL PROTECTED]
http/ftp: x8b4e53cd.dhcp.okstate.edu
It was starting to rain on the night that they cried forever,
It was blinding with snow on the night that they screamed goodbye.
        - Dio, "Rock and Roll Children"
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Re: Substituting malformed UTF-8 sequences in a decoder

Reply via email to