Re: Xterm Unicode Patch #9
On Thu, 27 Jul 2000, Mark Leisher wrote: Robert You know the drill - patch #9 (against xterm #140), now available Robert from http://www.zepler.org/~rw197/xterm/ I'm getting a "no such user alias" message from this URL. Oops, that should read http://www.zepler.org/~rwb197/xterm ^ Also, can someone post the set of links to the all the parts needed to get this xterm running? I'm trying to whip up directions for someone else and don't want to go through the search for all the parts needed all over again. xterm- ftp://dickey.his.com/xterm/xterm.tar.gz my patch - http://www.zepler.org/~rwb197/xterm/xterm-unicode-0.9.diff.gz ucsfonts - http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.tar.gz http://www.cl.cam.ac.uk/~mgk25/ucs-fonts-asian.tar.gz -- Robert - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder
Markus Kuhn's proposal D: All the previous options for converting malformed UTF-8 sequences to UTF-16 destroy information. ... Malformed UTF-8 sequences consist excludively of the bytes 0x80 - 0xff, and each of these bytes can be represented using a 16-bit value ... This way 100% binary transparent UTF-8 - UTF-16/32 - UTF-8 round-trip compatibility can be achieved quite easily. I don't like this proposal for a few reasons: * What interoperable and reliable software needs, is a clear and standardized interchange format. It must say "this is allowed" and "that is forbidden". If after a few years a standard starts saying "this was forbidden but is now allowed", then older software will not accept output from newer programs any more. And the result will be just like the mess we had around 1992 when some but not all Unix software was 8-bit clean. * A program which does something halfway intelligent, like the "fmt" line breaking program, needs to make assumptions about the characters it is treating. (In the case of fmt: recognize spaces and newlines, and know about their width.) The input is UTF-8 and is converted to UCS-4 via fgetwc. If this UCS-4 stream now contains characters which are only substitutes for *unknown* characters, the fmt program will never know the width of these. It will thus output (again in UTF-8) the original characters, but will not have done the correct line breaking. In summary, this leads to "garbage in - garbage out" behaviour of programs. Whereas a central point of Unicode is that applications know the behaviour of *all* characters, definitely. I much prefer the "garbage in - error message" way, because it enables the user or sysadmin to fix the problem (read: call recode on the data files). The appearance of U+FFFD is a kind of error message. * One of your most prominent arguments for the adoption of UTF-8 is that it in 99.99% of the cases an UTF-8 encoded file can easily distinguished from an ISO-8859-1 encoded one. If UTF-8 were extended so that lone bytes in the range 0x80..0xBF were considered valid, this argument would fall apart. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: utf-8 encoding scheme
On 21 Jul 2000, H. Peter Anvin wrote: One possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION CHARACTER on encountering illegal sequences. Unless you are Bill Gates and have the power to decree that your users *will* use your preferred decoder, this may be a mistake. Remember that the users of a decoder see no advantage from this behavior, since they are canonicalizing anyway. Um... not so... The user of the decoder is the user that gets bitten by these security holes... Um, no, I think you've missed my point. The user of a decoder is *not* going to get bitten by these security holes, because he's *decoding*. The act of decoding transforms the input into a form where these holes do not exist. The potential for security holes comes when you attempt to use the raw input, *without* decoding it. It is the *non-decoding* users who are vulnerable. This being so, decoding users -- who are not vulnerable -- may balk at having their programs misbehave on inputs which do not threaten them anyway. Implicit aliases are very dangerous. I agree, but the problem is to protect the non-decoding users, and doing substitutions in decoders may not be the best way to do that. Henry Spencer [EMAIL PROTECTED] - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: C++ STL locale
In the directory /usr/lib/locale we have differnt directories that have names of different locales. Right??? Yes. In each of these when have some more directories like LC_COLLATE, LC_CTYPE etc.. Unlike in en_US.UTF-8 dir where we have all the attributes ie LC_COLLATE LC_CTYPE LC_MESSAGES LC_MONETARY LC_NUMERIC LC_TIME LO_LTYPE, in other directories like fr.UTF-8 we have only LC_MESSAGES, what does this mean??? Once you have done "localedef -i fr -f UTF-8 fr.UTF-8", the fr.UTF-8 directory should contain the LC_COLLATE etc. files as well. what does this statement do exactly do UTF-8 is supposed to cover character of all the major languages that are used in the world today. Then what is the purpose of having differnt locales like de.UTF-8 de.UTF-8@euro en_US.UTF-8 es.UTF-8 The LC_CTYPE locales of these should all be identical, except for very minor differences. The other parts (LC_COLLATE, LC_TIME etc.) reflect local cultural habits - sorting order, time display format etc. - which are codeset independent. Now suppose we have made the dir LC_COLLATE using localedef. I guess all the collation sequences are concerned with this attribute. But where do we actually mention the collation order. Is it that we write the order os character as they are supposed to be in a file within the LC_COLLATE dir. If yes how is this done, ie. what statement do you use to make the actuall comparison. If no, why are this LC_COLLATE LC_CTYPE etc made as dir's , why not as files. The distinction between de.UTF-8 and de.UTF-8@euro, however, does not make sense. Maybe these two are identical? Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
UTF-EBCIDC to UTF-8
Hello, Is there any conversion routine that transformsUTF-EBCDIC characters to UTF-8 characters Regards Jeu - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/