Stewart Stremler wrote:
begin quoting Christopher Smith as of Tue, Oct 25, 2005 at 05:27:38PM -0700:
Stewart Stremler wrote:
[snip]
I don't think the problem is ASCII -- that's the sort of simple mapping
that's capable of being well-defined and standardized.
You're right that the problem isn't ASCII. The problem is that there
isn't really a canonical spelling of Tchaikovsky in ASCII, and
standardizing on one harder and more problematic than simply using the
Cyrillic representation, particularly when you need to standardize on
all the Tchaikovsky's out there.
Even within ASCII, there's more than one way to spell Shakespeare. That
problem isn't really resolved by choosing a glyph-set.
Actually, there is only one canonical spelling Shakespeare. The rest are
"other ways that people spell it". I'm sure you can find some historical
figure who's name was never spelled out in any kind of a canonical
context. The point remains there are a lot of contexts being able to use
the native alphabet really does avoid the problem.
And if you have glyphs that look similiar in some font, the problem
comes back... and so allowing all glyphs wasn't really a solution anyway.
Huh? No. The problem isn't with end users. The problem is with the
computers. End users can easily recognize that multiple spellings of
Tchaikovsky are talking about the same name, but computers have a real
hard time doing so unless you teach them on a case-by-case basis.
Interestingly, there are cases where you can still have multiple
spellings, even in the native alphabet, in cases where two glyph
sequences, often only slightly different, are interchangeable. Unicode
also deals with this, allowing transformations that remove these
differences, allowing one to programmatically come up with a canonical
represetnation.
More importantly, I can glance at
i18n and recognize what it represents much more quickly than
internationalization.
If you can do that, then remapping into ASCII should be a simple thing.
Actually remapping into ASCII as you've described does the reverse, as a
relatively short sequence would be transformed into a 4-5x longer
character sequence, which is exactly the issue with internatonalization.
Including the use of it. Is that microsoft.com with an oh, or some other
glyph that *looks* exactly like an oh?
Take the confusion we have with fonts where 1 and l look the same --
that's one of the major issues of Unicode write small.
Yes, but those problems don't go away in a world with multiple character
sets.
Indeed. But they do go away if there's a default representation in a
non-ambiguous character set.
Not at all. Part of the problem with longer words like
internationalization is that human readers tend to scan the word,
looking primarily at the first and last character and making all kinds
of assumptions about the rest. Indeed, I dropped a letter when I wrote
internationalization the first time in this e-mail, and I bet several
people didn't notice, despite being fully capable of spelling the word
correctly.
No, the longer the words get, the easier life is for phishers.
The real problem wasn't "how do we let everyone have single-character
glyphs", but "how do we let people write in their own language on a
computer".
Once you deal with the problem for a while, you discover that having a
way to represent glyphs as distinct entities (which is what a character
really is) is very much a needed capability in software, and not really
seperable from the problem of letting people write in their own language
on a computer.
I assert that (english) words can be considered glyphs (think cursive),
and therefore deserve the same sort of treatment.
One of the nice things is that with English you can work with words or
letters. Your choice. Sometimes people very much want to work with
letters. If nothing else it makes it easier to write spelling correction
software. The funny thing is that despite this choice, a lot of
programmers who deal exclusively with English still choose not to work
with the words but symbols. They might be on to something. ;-)
Since we're ready to accept bloat at the outset, a better
approach (to my way of thinking) would be to toss out ANSI, stick with
ASCII, and redefine those ANSI characters as indicators for variable
length strings that should constitute a glyph.
That is pretty similar to what UTF-8 is. The problem is that that isn't
the entire problem.
UTF8 is *almost* what I want. :)
The good news is that if you have a better encoding, you can petition
for it's adoption in the Unicode standard. Heck, you can just start
using it by fiat. Unicode really just enumerates the distinct glyphs and
provides some standard ways of representing them. It really doesn't
require you to use those representations. It just so happens that a lot
of fonts and software work with the standard ones.
You could look at the raw data if you wanted, in an unambiguous format
that would still be readable if it were only a character here or there.
Everyone wins, except for those who will need five-or-more character
strings to represent a glyph.
Yup. I turns out that basically most Asian countries hate UTF-8 because
it makes their characters bigger than local character sets.
And when we go to UCS-16 or UCS-32, we'll all hate *that*.
UTF-32 (UCS4) is pretty much bad for everyone, but UTF-16 does work
quite well for certain folks. The Chinese also have their own encoding
that is kind of like their equivalent of UTF-8. Then you have compact
representations like SCSU and others...
Plus, they're still dealing with simplified character sets, so what we
obviously need is UCS-64, right?
No, UTF-8, UTF-16, and UTF-32 (and indeed any unicode encoding) deal
with non-simplified character sets. The "simplified-only" thing pretty
much went away with the notion that Unicode meant fixed-width 16-bit
characters.
(I haven't gone and looked up how big Unicode actually gets...)
Well, it's hard to define what you mean by "big", but the existing glyph
set is pretty exhaustive, at least for "real" languages who have a
written form.
Presumably, we could stop when everyone on the planet gets their own
n-bit character space. That would be fair.
Where "n" is as big as they need... we're pretty much there.
...so we can avoid bloat in our XML documents...
Anyone who wants to avoid bloat doesn't use XML. There are lots of other
chases where you need a comprehensive character set but you still value
compactness.
to feel slighted when they are forced in to such things while you don't
see much of a negative impact.
And they're suprised when I feel I'm being force into such things
because they don't acknowledge a negative impact on me?
There is a difference between a system that screws everyone a little bit
and a system that primarily screws one group of people. It is possible
(though often difficult) to get people to feel okay about the former.
--Chris
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg