Re: Unicode and Friends

Christopher Smith Thu, 27 Oct 2005 11:56:34 -0700

Stewart Stremler wrote:
> begin  quoting Christopher Smith as of Wed, Oct 26, 2005 at 01:54:07AM -0700:
>>Stewart Stremler wrote:
>>>Even within ASCII, there's more than one way to spell Shakespeare. That
>>>problem isn't really resolved by choosing a glyph-set.
>>
>>Actually, there is only one canonical spelling Shakespeare. The rest are 
> 
> There are two canonical spellings according to various English teachers
> I had in high school -- one set said "Shakespear" and the other
> "Shakespeare".


One of your teachers was wrong. ;-)

>>"other ways that people spell it". I'm sure you can find some historical 
>>figure who's name was never spelled out in any kind of a canonical 
>>context.
> 
> Or didn't spell it consistently themselves.

Actually, even if they get it wrong sometimes, there are usually legal
documents which can clarify things. Still, there are exceptions.

>>         The point remains there are a lot of contexts being able to use 
>>the native alphabet really does avoid the problem.
> 
> Sure. "Do it my way and hang everyone else" solves a lot of problems. If
> we're going have that, I'd just as soon the way be my way.

That approach tends to be particularly offensive. Standards like Unicode
allow one to say, "hey, I'm not imposing my perspective on you, I'm
going with a standard." If you don't have a sense as to why this is
important, you've never used software written in Japan by a programmer
employing a similar attitude. ;-)

>>>And if you have glyphs that look similiar in some font, the problem
>>>comes back... and so allowing all glyphs wasn't really a solution anyway.
>>
>>Huh? No. The problem isn't with end users. The problem is with the 
>>computers. End users can easily recognize that multiple spellings of 
>>Tchaikovsky are talking about the same name, but computers have a real 
>>hard time doing so unless you teach them on a case-by-case basis.
> 
> Exactly. When you have similiar glyphs, the computer *still* can't tell,
> and the user can.   Consider those fonts where l == 1 -- when the user
> sees schoo1, he reads 'school', but the computer says "nope, different".

That is a problem with the font, not with the character set. Indeed,
it's a problem often encountered with ASCII. Perhaps the correct
solution is to apply your principles to english characters too. Nobody
would confuse '&el' with '&one'. ;-)

> If you're looking for standardized spelling, immensely increasing the
> number of symbols and including symbols that closely resemble other
> symbols is *not* the way to solve the problem.

It's way the heck better than having a small set of symbols which often
don't have any direct correspondence with the correct set of symbols.

> Any solution that introduces a flaw that causes the original problem
> to re-emerge isn't a very good solution.

I think you're seeing a different flaw than what I'm talking about.

>>Interestingly, there are cases where you can still have multiple 
>>spellings, even in the native alphabet, in cases where two glyph 
>>sequences, often only slightly different, are interchangeable. Unicode 
>>also deals with this, allowing transformations that remove these 
>>differences, allowing one to programmatically come up with a canonical 
>>represetnation.
> 
> And now you're changing the data behind the user's back. BAD PROGRAMMER!
> What if they  *meant* "schoo1" ?

Huh? Those aren't the kinds of transformations I'm talking about, and
you aren't changing the data, you use the transformations to create a
unique key. Don't think "schoo1" vs. "school". I'm talking about
canonical equivalence. Think in the context of collation.

>>>Indeed. But they do go away if there's a default representation in a
>>>non-ambiguous character set.
>>
>>Not at all. Part of the problem with longer words like 
>>internationalization is that human readers tend to scan the word, 
>>looking primarily at the first and last character and making all kinds 
>>of assumptions about the rest. Indeed, I dropped a letter when I wrote 
>>internationalization the first time in this e-mail, and I bet several 
>>people didn't notice, despite being fully capable of spelling the word 
>>correctly.
>>
>>No, the longer the words get, the easier life is for phishers.
> 
> The risk isn't that casual words that can be inferred from context,
> but rather that URLs that a user is instructed to go to can't be 
> checked.

More accurately, *won't* be checked. No such thing as a URL that *can't*
be checked. The issue is one of making them look visual similar enough
that someone just goes ahead and clicks on it without thinking. It turns
out that really long words with single character differences are at
least as "similar looking" as really short words with a similar looking
character. I seem to recall a rather powerful demonstration of this
where the order of all the characters except the first and last
character in each word, and the resultant text was amazingly readable.
Sure, you could easily tell they weren't spelled right, but imagine now
if only two of the letters were interchanged. better still if it's
displayed in a sans serif font that you aren't used to.

>>>I assert that (english) words can be considered glyphs (think cursive), 
>>>and therefore deserve the same sort of treatment.
>>
>>One of the nice things is that with English you can work with words or 
>>letters. Your choice.
> 
> Yes. Words are variable-length multi-byte character sequences. They
> are not efficient representations.
> 
> Why does china and tiawan and japan need efficient representations
> for words?

You think China's approach is really more efficient than English?

>>                      Sometimes people very much want to work with 
>>letters. If nothing else it makes it easier to write spelling correction 
>>software. The funny thing is that despite this choice, a lot of 
>>programmers who deal exclusively with English still choose not to work 
>>with the words but symbols. They might be on to something. ;-)
> 
> Yes. So why should someone else's words map into the character space? A
> character is a *component* of a word.

Characters are the atoms of written language. You can try to make a case
for things being broken down by brush or pencil strokes, but
fundamentally humanity seems to like to look at things on a character by
character basis. People have done what you're describing, and it turns
out that it really, really doesn't work.

>>using it by fiat. Unicode really just enumerates the distinct glyphs and 
>>provides some standard ways of representing them. It really doesn't 
>>require you to use those representations. It just so happens that a lot 
>>of fonts and software work with the standard ones.
> 
> Changing the world by fiat has never worked for me.  And presumably,
> the encoding *I* want would be entirely different from unicode, as the
> character sequences would best be given mnemonics where possible, and
> that is a huge task, on order of devising unicode itself.

That's because Unicode is what you are describing. They just went with a
different encoding than you did. Perhaps you don't like their encoding
choices, but it's pretty hard to get an encoding choice that makes
everyone happy.

I'm sorry, but you keep getting made about unicode and then describe
solutions that are so strikingly similar that it seems to me you ought
to consider the possibility that there are some additional factors that
you haven't considered that make the minor variations a good idea.

>>>Plus, they're still dealing with simplified character sets, so what we 
>>>obviously need is UCS-64, right?
>>
>>No, UTF-8, UTF-16, and UTF-32 (and indeed any unicode encoding) deal 
>>with non-simplified character sets. The "simplified-only" thing pretty 
>>much went away with the notion that Unicode meant fixed-width 16-bit 
>>characters.
> 
> So they went back and included all those family names, rarely used
> characters, and historical characters?
> 
> That is news!

I don't know about family names, but certainly they worked on the
historical stuff. It's amazing how hard it is to search a collection of
historical documents for historical phrases when you don't have
historical characters. ;-)

>>Where "n" is as big as they need... we're pretty much there.
> 
> Unicode dropped a lot of "less frequently used" symbols, at least
> that was the way it was last time this conversation went around and
> I spent a fair bit of time reading up on the pro/con unicode arguments.
> It's exhausting, but not really exhaustive...

It's now pretty much exhaustive. They even have room in the set for
things like Klingon (although fictional languages are not officially
sanctioned, there is a segment that is essentially unlicensed that has
been carefully divided up by those who are fans of fictional language).

> 2^64 gives us all permutations of an 8x8 array of pixels. Let's just
> declare 64 bits the new wordsize, and all get modern at the same time.
> 64-bit addressable machines? anyone?  Let's just be fair about it.

Then you're going to be making life difficult for those who work with
16x16 arrays. ;-)

> Don't get me wrong, I see that it's a hard problem.  I just don't think
> the *approach* (let's make every glyph a character and simply assign a
> number to each) is the right one.

Replace "number" with "byte sequence" and that's exactly the solution
you are prescribing. The only difference I see here is you're insisting
on one encoding and one set of glyphs, while Unicode only insists on one
set of glyphs.

>>>And they're suprised when I feel I'm being force into such things
>>>because they don't acknowledge a negative impact on me?
>>
>>There is a difference between a system that screws everyone a little bit 
>>and a system that primarily screws one group of people. It is possible 
>>(though often difficult) to get people to feel okay about the former.
> 
> However, if you cause pain to everyone so a subset can get a benefit
> that the rest do without... chances are, you're going to cause
> resentment.
> 
> We spell out words. Why shouldn't everyone else?

First of all, almost everyone does spell their words. It's a mistake to
think that languages without an alphabet still don't require multiple
symbols to form some of their words.

> Or if others don't have to spell their words, why should we?

Hey, and there is the crux of the matter. Why try to build a tower of
babel. Lets work with the way language actually works, rather than
trying to shoe horn languages to all work the way language works in the US.

--Chris

-- 
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Re: Unicode and Friends

Reply via email to