Re: Xterm Unicode Patch #9

2000-07-27 Thread Robert Brady

On Thu, 27 Jul 2000, Mark Leisher wrote:

 
 Robert You know the drill - patch #9 (against xterm #140), now available
 Robert from http://www.zepler.org/~rw197/xterm/
   
 I'm getting a "no such user alias" message from this URL.

Oops, that should read http://www.zepler.org/~rwb197/xterm
 ^

 Also, can someone post the set of links to the all the parts needed to get
 this xterm running?  I'm trying to whip up directions for someone else and
 don't want to go through the search for all the parts needed all over again.

xterm- ftp://dickey.his.com/xterm/xterm.tar.gz
my patch - http://www.zepler.org/~rwb197/xterm/xterm-unicode-0.9.diff.gz
ucsfonts - http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.tar.gz
   http://www.cl.cam.ac.uk/~mgk25/ucs-fonts-asian.tar.gz

-- 
Robert


-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-27 Thread Bruno Haible

Markus Kuhn's proposal D:
 All the previous options for converting malformed UTF-8 sequences to
 UTF-16 destroy information. ...
 Malformed UTF-8 sequences consist excludively of the bytes 0x80 -
 0xff, and each of these bytes can be represented using a 16-bit
 value ...
 This way 100% binary transparent UTF-8 - UTF-16/32 - UTF-8 round-trip
 compatibility can be achieved quite easily.

I don't like this proposal for a few reasons:

* What interoperable and reliable software needs, is a clear and
  standardized interchange format. It must say "this is allowed" and
  "that is forbidden". If after a few years a standard starts saying
  "this was forbidden but is now allowed", then older software will
  not accept output from newer programs any more. And the result will
  be just like the mess we had around 1992 when some but not all Unix
  software was 8-bit clean.

* A program which does something halfway intelligent, like the "fmt"
  line breaking program, needs to make assumptions about the
  characters it is treating. (In the case of fmt: recognize spaces and
  newlines, and know about their width.) The input is UTF-8 and is
  converted to UCS-4 via fgetwc. If this UCS-4 stream now contains
  characters which are only substitutes for *unknown* characters, the
  fmt program will never know the width of these. It will thus output
  (again in UTF-8) the original characters, but will not have done the
  correct line breaking.

  In summary, this leads to "garbage in - garbage out" behaviour of
  programs. Whereas a central point of Unicode is that applications
  know the behaviour of *all* characters, definitely.

  I much prefer the "garbage in - error message" way, because it
  enables the user or sysadmin to fix the problem (read: call recode
  on the data files). The appearance of U+FFFD is a kind of error
  message.

* One of your most prominent arguments for the adoption of UTF-8 is
  that it in 99.99% of the cases an UTF-8 encoded file can easily
  distinguished from an ISO-8859-1 encoded one. If UTF-8 were extended
  so that lone bytes in the range 0x80..0xBF were considered valid,
  this argument would fall apart.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-07-27 Thread Henry Spencer

On 21 Jul 2000, H. Peter Anvin wrote:
   One possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
   CHARACTER on encountering illegal sequences.
  Unless you are Bill Gates and have the power to decree that your users
  *will* use your preferred decoder, this may be a mistake.  Remember that
  the users of a decoder see no advantage from this behavior, since they are
  canonicalizing anyway.
 
 Um... not so...
 The user of the decoder is the user that gets bitten by these security
 holes...

Um, no, I think you've missed my point.  The user of a decoder is *not*
going to get bitten by these security holes, because he's *decoding*.  The
act of decoding transforms the input into a form where these holes do not
exist.  The potential for security holes comes when you attempt to use the
raw input, *without* decoding it.  It is the *non-decoding* users who are
vulnerable. 

This being so, decoding users -- who are not vulnerable -- may balk at
having their programs misbehave on inputs which do not threaten them anyway.

 Implicit aliases are very dangerous.

I agree, but the problem is to protect the non-decoding users, and doing
substitutions in decoders may not be the best way to do that. 

  Henry Spencer
   [EMAIL PROTECTED]

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: C++ STL locale

2000-07-27 Thread Jeu George

In the directory /usr/lib/locale we have differnt directories that have
  names of different locales. Right???
 
 Yes.
 
In each of these when have some more directories like LC_COLLATE,
  LC_CTYPE etc..  Unlike in en_US.UTF-8 dir where we have all the attributes
  ie LC_COLLATE LC_CTYPE LC_MESSAGES LC_MONETARY LC_NUMERIC LC_TIME
  LO_LTYPE, in other directories like fr.UTF-8 we have only LC_MESSAGES,
  what does this mean???
 
 Once you have done "localedef -i fr -f UTF-8 fr.UTF-8", the fr.UTF-8
 directory should contain the LC_COLLATE etc. files as well.

what does this statement do exactly do


 
UTF-8 is supposed to cover character of all the major languages that are
  used in the world today. Then what is the purpose of having differnt
  locales like de.UTF-8 de.UTF-8@euro en_US.UTF-8 es.UTF-8
 
 The LC_CTYPE locales of these should all be identical, except for very
 minor differences. The other parts (LC_COLLATE, LC_TIME etc.) reflect
 local cultural habits - sorting order, time display format etc. -
 which are codeset independent.

Now suppose we have made the dir LC_COLLATE using localedef. 
I guess all the collation sequences are concerned with this attribute. But
where do we actually mention the collation order. Is it that we write the
order os character as they are supposed to be in a file within the
LC_COLLATE dir. If yes how is this done, ie. what statement do you use to
make the actuall comparison. If no, why are this LC_COLLATE LC_CTYPE etc
made as dir's , why not as files.


 
 The distinction between de.UTF-8 and de.UTF-8@euro, however, does not
 make sense. Maybe these two are identical?
 
 Bruno
 -
 Linux-UTF8:   i18n of Linux on all levels
 Archive:  http://mail.nl.linux.org/lists/
 

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



UTF-EBCIDC to UTF-8

2000-07-27 Thread Jeu George

Hello,
Is there any conversion routine that transformsUTF-EBCDIC characters
to UTF-8 characters
Regards
Jeu


-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/