Unicode and Friends (Was: JSON)

Stewart Stremler Tue, 25 Oct 2005 16:49:26 -0700

begin  quoting Christopher Smith as of Tue, Oct 25, 2005 at 04:03:16PM -0700:
> Gabriel Sechan wrote:
> > Most of the time, you're writing program either for yourself, or your
> > company. Internal apps (at least those I work on) rarely to never go
> > overseas.
> 
> Yeah, that is becoming decreasingly true these days. Even apps that
> never go overseas have to deal with companies and products from
> overseas, and often representing them in ASCII is highly error prone (as
> indicated by the Tchaikovsky example).


I don't think the problem is ASCII -- that's the sort of simple mapping 
that's capable of being well-defined and standardized.

>                                        Basically, once you are dealing
> with an interchange format where you are considering XML, the
> probabilities really start to skew towards i18n issues (l10n issues
> might not show up unless you specifically have non-local users).
 
i18n and l10n are examples of that simple mapping *within* a language.
(It's not like "internationalization" is hard to spell or anything. Or
type, if you aren't hunting-and-pecking your way around the keyboard.)

> > I speak only English.  So worrying about internationalization
> > is a waste of my time.
>
> I speak French and English, but I can assure you the stuff where I need
> to deal with i18n is generally not French. ;-) Basically, if what you
> are doing is tied to the Internet, there is a distinct risk of i18n issues.
 
Including the use of it.  Is that microsoft.com with an oh, or some other
glyph that *looks* exactly like an oh?

Take the confusion we have with fonts where 1 and l look the same --
that's one of the major issues of Unicode write small.

> > If it comes out it needs to be added later, its cheaper to do so then
> > on the average.
> 
> For the most part I agree with you. The problem is when it needs to be
> there from the get go, and one fails to recognize this, then all hell
> breaks loose. Also, certain languages (C in particular) can make it much
> harder for you to refactor to a different string representation without
> introducing a lot of difficult to identify bugs. So, while going all out
> on i18n might be a lot of wasted effort. Making sure you have the right
> abstractions so that you can handle it later is important.

I think the problem is that unicode tried to solve the wrong problem.
The real problem wasn't "how do we let everyone have single-character
glyphs", but "how do we let people write in their own language on a
computer".  Since we're ready to accept bloat at the outset, a better
approach (to my way of thinking) would be to toss out ANSI, stick with
ASCII, and redefine those ANSI characters as indicators for variable
length strings that should constitute a glyph.

Old software still works -- and given the correct display smarts (e.g.
rewrite printf), works transparently.  You could look at the raw data 
if you wanted, in an unambiguous format that would still be readable if 
it were only a character here or there.  Everyone wins, except for
those who will need five-or-more character strings to represent a glyph.

(But as those languages use a glyph-per-word, more or less, this 
shouldn't be a problem -- nobody was demanding that a sizable subset
of the english dictionary be mapped into unicode space. Fair's fair.)

[snip]
> I think we're really in agreement on this. I think the issue is
> mandating the use of Unicode in all cases, as opposed to making sure you
> have support for it in standard tools and APIs.

Heh. My issue *is* Unicode.  I believe that Unicode was a solution that 
was arrived at early and all the brainpower was put into making it work 
instead of asking "is this the right thing to do?"  This is often the 
case with smart people, I find... they *can* make it work, so they don't
stop to think about whether it's worth it.

-Stewart

-- 
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Unicode and Friends (Was: JSON)

Reply via email to