Re: Unicode and Friends (Was: JSON)

Christopher Smith Wed, 26 Oct 2005 01:55:58 -0700

Stewart Stremler wrote:

begin  quoting Christopher Smith as of Tue, Oct 25, 2005 at 05:27:38PM -0700:

Stewart Stremler wrote:

[snip]

I don't think the problem is ASCII -- that's the sort of simple mappingthat's capable of being well-defined and standardized.

You're right that the problem isn't ASCII. The problem is that there
isn't really a canonical spelling of Tchaikovsky in ASCII, and
standardizing on one harder and more problematic than simply using the
Cyrillic representation, particularly when you need to standardize on
all the Tchaikovsky's out there.


Even within ASCII, there's more than one way to spell Shakespeare. That
problem isn't really resolved by choosing a glyph-set.

Actually, there is only one canonical spelling Shakespeare. The rest are"other ways that people spell it". I'm sure you can find some historicalfigure who's name was never spelled out in any kind of a canonicalcontext. The point remains there are a lot of contexts being able to usethe native alphabet really does avoid the problem.

And if you have glyphs that look similiar in some font, the problem
comes back... and so allowing all glyphs wasn't really a solution anyway.

Huh? No. The problem isn't with end users. The problem is with thecomputers. End users can easily recognize that multiple spellings ofTchaikovsky are talking about the same name, but computers have a realhard time doing so unless you teach them on a case-by-case basis.

Interestingly, there are cases where you can still have multiplespellings, even in the native alphabet, in cases where two glyphsequences, often only slightly different, are interchangeable. Unicodealso deals with this, allowing transformations that remove thesedifferences, allowing one to programmatically come up with a canonicalrepresetnation.

                                      More importantly, I can glance at
i18n and recognize what it represents much more quickly than
internationalization.


If you can do that, then remapping into ASCII should be a simple thing.

Actually remapping into ASCII as you've described does the reverse, as arelatively short sequence would be transformed into a 4-5x longercharacter sequence, which is exactly the issue with internatonalization.

Including the use of it.  Is that microsoft.com with an oh, or some other
glyph that *looks* exactly like an oh?

Take the confusion we have with fonts where 1 and l look the same --
that's one of the major issues of Unicode write small.

Yes, but those problems don't go away in a world with multiple character
sets.


Indeed. But they do go away if there's a default representation in a
non-ambiguous character set.

Not at all. Part of the problem with longer words likeinternationalization is that human readers tend to scan the word,looking primarily at the first and last character and making all kindsof assumptions about the rest. Indeed, I dropped a letter when I wroteinternationalization the first time in this e-mail, and I bet severalpeople didn't notice, despite being fully capable of spelling the wordcorrectly.


No, the longer the words get, the easier life is for phishers.

The real problem wasn't "how do we let everyone have single-character
glyphs", but "how do we let people write in their own language on a
computer".

Once you deal with the problem for a while, you discover that having a
way to represent glyphs as distinct entities (which is what a character
really is) is very much a needed capability in software, and not really
seperable from the problem of letting people write in their own language
on a computer.

I assert that (english) words can be considered glyphs (think cursive),and therefore deserve the same sort of treatment.

One of the nice things is that with English you can work with words orletters. Your choice. Sometimes people very much want to work withletters. If nothing else it makes it easier to write spelling correctionsoftware. The funny thing is that despite this choice, a lot ofprogrammers who deal exclusively with English still choose not to workwith the words but symbols. They might be on to something. ;-)

Since we're ready to accept bloat at the outset, a better
approach (to my way of thinking) would be to toss out ANSI, stick with
ASCII, and redefine those ANSI characters as indicators for variable
length strings that should constitute a glyph.

That is pretty similar to what UTF-8 is. The problem is that that isn't
the entire problem.


UTF8 is *almost* what I want.  :)

The good news is that if you have a better encoding, you can petitionfor it's adoption in the Unicode standard. Heck, you can just startusing it by fiat. Unicode really just enumerates the distinct glyphs andprovides some standard ways of representing them. It really doesn'trequire you to use those representations. It just so happens that a lotof fonts and software work with the standard ones.

You could look at the raw data if you wanted, in an unambiguous format
that would still be readable if it were only a character here or there.
Everyone wins, except for those who will need five-or-more character
strings to represent a glyph.

Yup. I turns out that basically most Asian countries hate UTF-8 because
it makes their characters bigger than local character sets.


And when we go to UCS-16 or UCS-32, we'll all hate *that*.

UTF-32 (UCS4) is pretty much bad for everyone, but UTF-16 does workquite well for certain folks. The Chinese also have their own encodingthat is kind of like their equivalent of UTF-8. Then you have compactrepresentations like SCSU and others...

Plus, they're still dealing with simplified character sets, so what weobviously need is UCS-64, right?

No, UTF-8, UTF-16, and UTF-32 (and indeed any unicode encoding) dealwith non-simplified character sets. The "simplified-only" thing prettymuch went away with the notion that Unicode meant fixed-width 16-bitcharacters.

(I haven't gone and looked up how big Unicode actually gets...)

Well, it's hard to define what you mean by "big", but the existing glyphset is pretty exhaustive, at least for "real" languages who have awritten form.

Presumably, we could stop when everyone on the planet gets their own
n-bit character space.  That would be fair.

Where "n" is as big as they need... we're pretty much there.

...so we can avoid bloat in our XML documents...

Anyone who wants to avoid bloat doesn't use XML. There are lots of otherchases where you need a comprehensive character set but you still valuecompactness.

to feel slighted when they are forced in to such things while you don't
see much of a negative impact.


And they're suprised when I feel I'm being force into such things
because they don't acknowledge a negative impact on me?

There is a difference between a system that screws everyone a little bitand a system that primarily screws one group of people. It is possible(though often difficult) to get people to feel okay about the former.


--Chris

--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Re: Unicode and Friends (Was: JSON)

Reply via email to