Re: Unicode and Friends

Stewart Stremler Thu, 27 Oct 2005 13:23:25 -0700

begin  quoting Christopher Smith as of Thu, Oct 27, 2005 at 11:54:10AM -0700:
> > Sure. "Do it my way and hang everyone else" solves a lot of problems. If
> > we're going have that, I'd just as soon the way be my way.
> 
> That approach tends to be particularly offensive. Standards like Unicode
> allow one to say, "hey, I'm not imposing my perspective on you, I'm
> going with a standard." If you don't have a sense as to why this is
> important, you've never used software written in Japan by a programmer
> employing a similar attitude. ;-)


ASCII is a standard, too.

Just because you're sticking with a standard doesn't mean you've
solved a problem.

[snip]
> > Exactly. When you have similiar glyphs, the computer *still* can't tell,
> > and the user can.   Consider those fonts where l == 1 -- when the user
> > sees schoo1, he reads 'school', but the computer says "nope, different".
> 
> That is a problem with the font, not with the character set. Indeed,

It's a problem with an interaction between the font and the character
set.

> it's a problem often encountered with ASCII. Perhaps the correct
> solution is to apply your principles to english characters too. Nobody
> would confuse '&el' with '&one'. ;-)

So... NOW you're willing to take the space tradeoff? :)
 
> > If you're looking for standardized spelling, immensely increasing the
> > number of symbols and including symbols that closely resemble other
> > symbols is *not* the way to solve the problem.
> 
> It's way the heck better than having a small set of symbols which often
> don't have any direct correspondence with the correct set of symbols.

I emphatically disagree.

When we get right down to it, we're using 1s and 0s. We chain 'em
togehter to make minimum-size entities, and build everything up from
there.

> > Any solution that introduces a flaw that causes the original problem
> > to re-emerge isn't a very good solution.
> 
> I think you're seeing a different flaw than what I'm talking about.

No doubt. I'm talking security and risk, not convenience and efficient
representation.

[snip]
> > And now you're changing the data behind the user's back. BAD PROGRAMMER!
> > What if they  *meant* "schoo1" ?
> 
> Huh? Those aren't the kinds of transformations I'm talking about, and
> you aren't changing the data, you use the transformations to create a
> unique key. Don't think "schoo1" vs. "school". I'm talking about
> canonical equivalence. Think in the context of collation.

{ all ways to spell $term } -> { $term }

You're changing the data *somewhere*.

[snip]
> >>No, the longer the words get, the easier life is for phishers.
> > 
> > The risk isn't that casual words that can be inferred from context,
> > but rather that URLs that a user is instructed to go to can't be 
> > checked.
> 
> More accurately, *won't* be checked. No such thing as a URL that *can't*
> be checked. The issue is one of making them look visual similar enough
> that someone just goes ahead and clicks on it without thinking. It turns

The problem isn't "without thinking", it's _with_ thinking.

A reasonably careful user can be fooled. Sure, if they were to cut
and paste the url into a file and then run od or hexdump, they could
tell the difference, but that's too much to expect a user to do.

> out that really long words with single character differences are at
> least as "similar looking" as really short words with a similar looking
> character. 

Cite?

>            I seem to recall a rather powerful demonstration of this
> where the order of all the characters except the first and last
> character in each word, and the resultant text was amazingly readable.
> Sure, you could easily tell they weren't spelled right, but imagine now
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

THAT IS THE CHECK.

> if only two of the letters were interchanged. better still if it's
> displayed in a sans serif font that you aren't used to.

With unicode, *most* of the possible symbols will be ones I'm not used to.

[snip]
> > Yes. Words are variable-length multi-byte character sequences. They
> > are not efficient representations.
> > 
> > Why does china and tiawan and japan need efficient representations
> > for words?
> 
> You think China's approach is really more efficient than English?

For a human? No. Consequently, I don't see why there's all this effort
to make it efficient for a computer.  Presumably one could breakd down
ideograms into strokes, and then combine strokes into words... of course,
this would be a far harder problem than merely assigning a number to
each ideogram.

I think that Chinese/Japanese/etc. make for very *pretty* writing. It's
efficient for generating art -- just write something. :)

  _|/
  /J\
 
[snip]
> > Yes. So why should someone else's words map into the character space? A
> > character is a *component* of a word.
> 
> Characters are the atoms of written language.  You can try to make a case
> for things being broken down by brush or pencil strokes, but
> fundamentally humanity seems to like to look at things on a character by
> character basis.

The atoms of communication, however, are *words*.

>                  People have done what you're describing, and it turns
> out that it really, really doesn't work.

Works in TeX just fine. :)
 
[snip]
> I'm sorry, but you keep getting made about unicode and then describe
?
> solutions that are so strikingly similar that it seems to me you ought
> to consider the possibility that there are some additional factors that
> you haven't considered that make the minor variations a good idea.

I disagree with some of their fundamental goals (a space-efficient
representation is needed for characters that fulfill the role of words);
not the problem (a standard is needed to represent all glyphs).  Of
course the solutions will be similiar.

Just because they were a big committee of smart people that worked hard
doesn't mean that they've automatically got it right.  Microsoft employs
a lot of smart people who work hard, and yet I have chosen to not believe
that they "got it right".

Perhaps I should just trust that Congress always does the right thing 
in the best interests of everyone and never ever bother to think that
perhaps they ignored the sorts of things I would worry about because
they just don't care about those sorts of things?

[snip]
> > That is news!
> 
> I don't know about family names, but certainly they worked on the
> historical stuff. It's amazing how hard it is to search a collection of
> historical documents for historical phrases when you don't have
> historical characters. ;-)

Oh, good, that's nice they caved in there...

[snip]
> > It's exhausting, but not really exhaustive...
> 
> It's now pretty much exhaustive. They even have room in the set for
> things like Klingon (although fictional languages are not officially
> sanctioned, there is a segment that is essentially unlicensed that has
> been carefully divided up by those who are fans of fictional language).
 
Heh. Elvish presumably made it in as well, one hopes.

> > 2^64 gives us all permutations of an 8x8 array of pixels. Let's just
> > declare 64 bits the new wordsize, and all get modern at the same time.
> > 64-bit addressable machines? anyone?  Let's just be fair about it.
> 
> Then you're going to be making life difficult for those who work with
> 16x16 arrays. ;-)

And if you give in to THOSE folks, well, then you have explain why
you're discriminating against those who want 24x24 pixel arrays.
 
[snip]
> Replace "number" with "byte sequence" and that's exactly the solution
> you are prescribing.

Not quite!  If that's what you think I'm prescribing, no wonder we're
talking past each other.

>                      The only difference I see here is you're insisting
> on one encoding and one set of glyphs, while Unicode only insists on one
> set of glyphs.
 
Not quite.  I would rather the problem have been tackled from the 
other direction... start with an ASCII-based encoding, and extend from
there, and build up the glyph-set, rather than create the glyph-set,
and work out the encoding.

[snip]
> > We spell out words. Why shouldn't everyone else?
> 
> First of all, almost everyone does spell their words. It's a mistake to
> think that languages without an alphabet still don't require multiple
> symbols to form some of their words.

Some, yes. We do that in english, too, with compound words.  Given
that we could have a glyph for "house" and another for "boat", we
wouldn't need glyphs for "houseboat" or "boathouse".

Given 30,000 glyphs and 100,000 words, of course you'll have to "spell"
words.  That doesn't mean that the base you've chosen isn't excessive.

> > Or if others don't have to spell their words, why should we?
> 
> Hey, and there is the crux of the matter. Why try to build a tower of
> babel.

I suspect you're abusing the metaphor. The tower of babel was an attempt
to reach heaven, and *resulted* in the fragmentation of language.

>        Lets work with the way language actually works, rather than
> trying to shoe horn languages to all work the way language works in the US.

The way language actually works is that one group goes and beats up
another group and makes 'em do it their way instead.

Sounds like this has happened. I don't see why everyone expects me to be
ecstatic that I'm on the losing side.

-Stewart

-- 
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Re: Unicode and Friends

Reply via email to