Re: Unicode and Friends

Stewart Stremler Thu, 27 Oct 2005 10:36:40 -0700

begin  quoting Christopher Smith as of Wed, Oct 26, 2005 at 01:54:07AM -0700:
> Stewart Stremler wrote:
> >Even within ASCII, there's more than one way to spell Shakespeare. That
> >problem isn't really resolved by choosing a glyph-set.
>
> Actually, there is only one canonical spelling Shakespeare. The rest are


There are two canonical spellings according to various English teachers
I had in high school -- one set said "Shakespear" and the other
"Shakespeare".

> "other ways that people spell it". I'm sure you can find some historical 
> figure who's name was never spelled out in any kind of a canonical 
> context.

Or didn't spell it consistently themselves.

>          The point remains there are a lot of contexts being able to use 
> the native alphabet really does avoid the problem.

Sure. "Do it my way and hang everyone else" solves a lot of problems. If
we're going have that, I'd just as soon the way be my way.

> >And if you have glyphs that look similiar in some font, the problem
> >comes back... and so allowing all glyphs wasn't really a solution anyway.
>
> Huh? No. The problem isn't with end users. The problem is with the 
> computers. End users can easily recognize that multiple spellings of 
> Tchaikovsky are talking about the same name, but computers have a real 
> hard time doing so unless you teach them on a case-by-case basis.

Exactly. When you have similiar glyphs, the computer *still* can't tell,
and the user can.   Consider those fonts where l == 1 -- when the user
sees schoo1, he reads 'school', but the computer says "nope, different".

If you're looking for standardized spelling, immensely increasing the
number of symbols and including symbols that closely resemble other
symbols is *not* the way to solve the problem.

Any solution that introduces a flaw that causes the original problem
to re-emerge isn't a very good solution.
 
> Interestingly, there are cases where you can still have multiple 
> spellings, even in the native alphabet, in cases where two glyph 
> sequences, often only slightly different, are interchangeable. Unicode 
> also deals with this, allowing transformations that remove these 
> differences, allowing one to programmatically come up with a canonical 
> represetnation.

And now you're changing the data behind the user's back. BAD PROGRAMMER!
What if they  *meant* "schoo1" ?
 
> >If you can do that, then remapping into ASCII should be a simple thing.
> >
> Actually remapping into ASCII as you've described does the reverse, as a 
> relatively short sequence would be transformed into a 4-5x longer 
> character sequence, which is exactly the issue with internatonalization.

I was thinking about correctness.  You can remap from one character
sequence into another.  Length... well, I already find i10n stupid
on account of length (like the Y2K problem). . . but then, I use 
abbrevations like Y2K. 

Hmph. :-/

Perhaps it has something to do with the first time I saw i10n and l10n
it wasn't in a very good font, and I read it as "il0n" and "llOn" and
it made no sense whatsoever.

> >Indeed. But they do go away if there's a default representation in a
> >non-ambiguous character set.
>
> Not at all. Part of the problem with longer words like 
> internationalization is that human readers tend to scan the word, 
> looking primarily at the first and last character and making all kinds 
> of assumptions about the rest. Indeed, I dropped a letter when I wrote 
> internationalization the first time in this e-mail, and I bet several 
> people didn't notice, despite being fully capable of spelling the word 
> correctly.
> 
> No, the longer the words get, the easier life is for phishers.

The risk isn't that casual words that can be inferred from context,
but rather that URLs that a user is instructed to go to can't be 
checked.

The advice currently given is "don't click on a link unless you've
verified that where it goes is actually where you want it to go".
So users learn to check and don't click on a link that says
http://bank0famerica.com -- but if we have the joy of unicode,
you can pick a glyph that isn't so easily distinguished.

The solution "make the user type in the url" doesn't work, as it's
not acceptable to the user -- and they may make a lot of mistakes.
(Plus you run into the risk of typosquatters.)

The solution "use certificates" isn't _that_ useful as the site might
present a legitimate certificate.  You have to verify that the
certificate is _for_ who you think it is.  How do you do that? You
LOOK at it.

> >I assert that (english) words can be considered glyphs (think cursive), 
> >and therefore deserve the same sort of treatment.
>
> One of the nice things is that with English you can work with words or 
> letters. Your choice.

Yes. Words are variable-length multi-byte character sequences. They
are not efficient representations.

Why does china and tiawan and japan need efficient representations
for words?

>                       Sometimes people very much want to work with 
> letters. If nothing else it makes it easier to write spelling correction 
> software. The funny thing is that despite this choice, a lot of 
> programmers who deal exclusively with English still choose not to work 
> with the words but symbols. They might be on to something. ;-)

Yes. So why should someone else's words map into the character space? A
character is a *component* of a word.

> >UTF8 is *almost* what I want.  :)
>
> The good news is that if you have a better encoding, you can petition 
> for it's adoption in the Unicode standard. Heck, you can just start 

Heh. Right. Nobody takes me serioiusly *here*, and you think someone
who's made a career out of unicode is going to take a suggestion to
scrap the whole thing and start over?

> using it by fiat. Unicode really just enumerates the distinct glyphs and 
> provides some standard ways of representing them. It really doesn't 
> require you to use those representations. It just so happens that a lot 
> of fonts and software work with the standard ones.

Changing the world by fiat has never worked for me.  And presumably,
the encoding *I* want would be entirely different from unicode, as the
character sequences would best be given mnemonics where possible, and
that is a huge task, on order of devising unicode itself.

(If you look at the ASCII encoding, a lot of work went in to making
it *sensible*. It's not a simple enumeration of the available glyphs.)
 
Basically, I've lost before I got to the gate, and I resent being told
that it's for my own good and that I should just suffer in silence.
Bugger that for a piece of cheese.

> >And when we go to UCS-16 or UCS-32, we'll all hate *that*.
>
> UTF-32 (UCS4) is pretty much bad for everyone, but UTF-16 does work 
> quite well for certain folks. The Chinese also have their own encoding 
> that is kind of like their equivalent of UTF-8. Then you have compact 
> representations like SCSU and others...

Yah, it's a mess. :(

> >Plus, they're still dealing with simplified character sets, so what we 
> >obviously need is UCS-64, right?
>
> No, UTF-8, UTF-16, and UTF-32 (and indeed any unicode encoding) deal 
> with non-simplified character sets. The "simplified-only" thing pretty 
> much went away with the notion that Unicode meant fixed-width 16-bit 
> characters.

So they went back and included all those family names, rarely used
characters, and historical characters?

That is news!
 
> >(I haven't gone and looked up how big Unicode actually gets...)
>
> Well, it's hard to define what you mean by "big", but the existing glyph 
> set is pretty exhaustive, at least for "real" languages who have a 
> written form.
[snip]
> Where "n" is as big as they need... we're pretty much there.

Unicode dropped a lot of "less frequently used" symbols, at least
that was the way it was last time this conversation went around and
I spent a fair bit of time reading up on the pro/con unicode arguments.
It's exhausting, but not really exhaustive...

2^64 gives us all permutations of an 8x8 array of pixels. Let's just
declare 64 bits the new wordsize, and all get modern at the same time.
64-bit addressable machines? anyone?  Let's just be fair about it.

(Then we'll have the chinese complaining that the americans are abusing
the system by using an 8-symbols-per-character encoding and generating
nonsense characters.)

> >...so we can avoid bloat in our XML documents...
>
> Anyone who wants to avoid bloat doesn't use XML. There are lots of other 
> chases where you need a comprehensive character set but you still value 
> compactness.

Yes. And thus, to be fair, unicode should have include the english
lexicon -- or the root words plus suffixes and prefixes -- in the
spec, just like what everyone else who demands it gets.  It's an
outrage that our text documents are so big.

Don't get me wrong, I see that it's a hard problem.  I just don't think
the *approach* (let's make every glyph a character and simply assign a
number to each) is the right one.

> >And they're suprised when I feel I'm being force into such things
> >because they don't acknowledge a negative impact on me?
>
> There is a difference between a system that screws everyone a little bit 
> and a system that primarily screws one group of people. It is possible 
> (though often difficult) to get people to feel okay about the former.

However, if you cause pain to everyone so a subset can get a benefit
that the rest do without... chances are, you're going to cause
resentment.

We spell out words. Why shouldn't everyone else?

Or if others don't have to spell their words, why should we?

-Stewart

-- 
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Re: Unicode and Friends

Reply via email to