>From [email protected] Mon Jul 27 10:07:03 2009 Received: from mail-fx0-f215.google.com (mail-fx0-f215.google.com [209.85.220.215]) by lib.oat.com (8.14.3/8.14.3) with ESMTP id n6RE6wFn018594 for <[email protected]>; Mon, 27 Jul 2009 10:07:02 -0400 (EDT) Received: by fxm11 with SMTP id 11so2735613fxm.18 for <[email protected]>; Mon, 27 Jul 2009 07:06:52 -0700 (PDT) Received: by 10.204.70.19 with SMTP id b19mr3168590bkj.62.1248703612472; Mon, 27 Jul 2009 07:06:52 -0700 (PDT) Subject: Re: locale en_US.ASCII? To: Geoff <[email protected]> Cc: [email protected]
2009/7/27 Geoff <[email protected]>: >> Is there a reason the en_US.ASCII locale is not available in >> the standard distribution? on Mon, 27 Jul 2009 16:06:52 +0200 ropers <[email protected]> wrote: >Well, UTF-8 is backwards-compatible with US-ASCII. Or maybe when you >said "US-ASCII" you were really thinking of ye olde Code Page 437? >http://en.wikipedia.org/wiki/Code_page_437 No, I'm thinking about 0x20 (space) to 0x7E with upper and lower case ABCDEFGHIJKLMNOPQRSTUVWXYZ, 0123456789, and the US ASCII punctuation from 0x21 <-> 0x2F and 0x7B <-> 0x7E. That is the common character set for keyboard input and printable output across all the systems I use. If all my keyboards, displays, and listing printers used a consistent ISO8859-1 set (for example), there wouldn't be much of a problem. I need to be able to compose & edit text-like data for a number of systems which use various character sets. I need to be able to prepare that data in a consistent manner logged into an OpenBSD system from these systems. The printers and other specialized I/O devices producing and consuming my data map 8-bit codes to glyphs in different ways: mostly nonstandard or obscure. Worse, many of them have multiple mappings available. If I set the locale to "C", most systems define the characters from 0x7F-0xFF as not printable. Editors, etc, print them as some sort of escaped or hex representation which is consistent and I can enter the values as hex or octal. (n)VI, for instance, shows characters as backslash & 2 hex digits, and inputs them as control-X & 2 hex digits. The problem is that UTF-8 is a superset of US-ASCII. Various implementations differ incompatibly on how to handle the characters not in US-ASCII, even when supposedly set to the same locale: glyphs are missing from fonts glyphs are inconsistent from system to system characters are printed as (character-0x80) etc. Input encoding from keyboards often also changes. Since the OpenBSD default "C" locale defines many of the characters 0x7F-0xFF as printable, my local systems attempt to print the characters in ways that are often not visible or readable. I understand that many people don't have US-ASCII keyboards or displays and find the limitation to that character set a problem. Still, the ability to map character/byte/octet values to and from visible marks in a completely consistent manner is valuable to me and perhaps to other people as well. >That said, It is my understanding that Unicode support in OpenBSD >hasn't been completed yet (correct me if I'm wrong). I don't believe it is there as well, since I haven't seen wchar typed data in sources very often. That's (selfishly) fine for me. Living in a US-English environment shields me from a lot of the need for conversion. Unicode is an extremely unpleasant thing I'm avoiding until the worth of the outcome exceeds the pain of conversion. For instance, will the world settle on a compressed form using the escape convention or will all data change to 16-bit? Or both? Legacy hardware, programs and data have a way of living forever. Translation to and from Unicode will be with us for a very long time even when the common systems and programs are all Unicode-aware. :-( :-( :-( geoff steckel

