Re: Unicode and Friends (Was: JSON)

Stewart Stremler Tue, 25 Oct 2005 23:52:35 -0700

begin  quoting Christopher Smith as of Tue, Oct 25, 2005 at 05:27:38PM -0700:
> Stewart Stremler wrote:
[snip]
> > I don't think the problem is ASCII -- that's the sort of simple mapping 
> > that's capable of being well-defined and standardized.
> 
> You're right that the problem isn't ASCII. The problem is that there
> isn't really a canonical spelling of Tchaikovsky in ASCII, and
> standardizing on one harder and more problematic than simply using the
> Cyrillic representation, particularly when you need to standardize on
> all the Tchaikovsky's out there.


Even within ASCII, there's more than one way to spell Shakespeare. That
problem isn't really resolved by choosing a glyph-set.

And if you have glyphs that look similiar in some font, the problem
comes back... and so allowing all glyphs wasn't really a solution anyway.

> > i18n and l10n are examples of that simple mapping *within* a language.
> > (It's not like "internationalization" is hard to spell or anything. Or
> > type, if you aren't hunting-and-pecking your way around the keyboard.)
> 
> I'm not a hunt-and-pecker, and I make enough typeohs without spelling
> out internationalization all the time.

Presumably your have software that can help. :)

>                                        More importantly, I can glance at
> i18n and recognize what it represents much more quickly than
> internationalization.

If you can do that, then remapping into ASCII should be a simple thing.

[snip]
> > Including the use of it.  Is that microsoft.com with an oh, or some other
> > glyph that *looks* exactly like an oh?
> > 
> > Take the confusion we have with fonts where 1 and l look the same --
> > that's one of the major issues of Unicode write small.
> 
> Yes, but those problems don't go away in a world with multiple character
> sets.

Indeed. But they do go away if there's a default representation in a
non-ambiguous character set.

[snip]
> > I think the problem is that unicode tried to solve the wrong problem.
> 
> Or perhaps, people looked for it as the solution to the wrong problem.

Hm...

> > The real problem wasn't "how do we let everyone have single-character
> > glyphs", but "how do we let people write in their own language on a
> > computer".
> 
> Once you deal with the problem for a while, you discover that having a
> way to represent glyphs as distinct entities (which is what a character
> really is) is very much a needed capability in software, and not really
> seperable from the problem of letting people write in their own language
> on a computer.
 
I assert that (english) words can be considered glyphs (think cursive), 
and therefore deserve the same sort of treatment.

> > Since we're ready to accept bloat at the outset, a better
> > approach (to my way of thinking) would be to toss out ANSI, stick with
> > ASCII, and redefine those ANSI characters as indicators for variable
> > length strings that should constitute a glyph.
> 
> That is pretty similar to what UTF-8 is. The problem is that that isn't
> the entire problem.

UTF8 is *almost* what I want.  :)

> > Old software still works -- and given the correct display smarts (e.g.
> > rewrite printf), works transparently.
> 
> /me falls out of chair
> 
> No, it breaks the first time it makes the assumption that a character or
> glyph is exactly one byte, or at the very least fixed width (sadly there
> is almost as much software that thinks fixed width 16-bit characters are
> all you need as there is software that thinks fixed width 8-bit
> characters are all you need). Think of how many C programs you've seen
> that look for a specific byte in a string somewhere, without considering
> the possibility that it might be part of a multiple byte character.
> Indeed, a lot of old lexers suffer from this problem.

If I'm on an old system, I *want* that. 

For me, that's a FEATURE, not a problem.
 
> > You could look at the raw data if you wanted, in an unambiguous format
> > that would still be readable if it were only a character here or there.
> > Everyone wins, except for those who will need five-or-more character
> > strings to represent a glyph.
> 
> Yup. I turns out that basically most Asian countries hate UTF-8 because
> it makes their characters bigger than local character sets.
 
And when we go to UCS-16 or UCS-32, we'll all hate *that*. Plus, 
they're still dealing with simplified character sets, so what we 
obviously need is UCS-64, right?

(I haven't gone and looked up how big Unicode actually gets...)

Presumably, we could stop when everyone on the planet gets their own
n-bit character space.  That would be fair.

> > (But as those languages use a glyph-per-word, more or less, this 
> > shouldn't be a problem -- nobody was demanding that a sizable subset
> > of the english dictionary be mapped into unicode space. Fair's fair.)
> 
> Actually, not all of those language use glyph-per-word, and the issue is
> that there is a more compact and efficient representation. People tend

...so we can avoid bloat in our XML documents...

> to feel slighted when they are forced in to such things while you don't
> see much of a negative impact.

And they're suprised when I feel I'm being force into such things
because they don't acknowledge a negative impact on me?

[snip]
> Unicode has evolved significantly from the early days, and as such it
> has done a reasonable job of addressing needs that have emerged since
> its inception. I find its problems are mostly of the "design by
> committee" nature and the tendancy to see it as the solution to all
> things, and that's hard to avoid with standards.

True, true. Design by committee tends to aim at making everyone
equally unhappy.

-Stewart

-- 
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Re: Unicode and Friends (Was: JSON)

Reply via email to