Re: Unicode and Friends (Was: JSON)

Christopher Smith Tue, 25 Oct 2005 17:30:45 -0700

Stewart Stremler wrote:
> begin  quoting Christopher Smith as of Tue, Oct 25, 2005 at 04:03:16PM -0700:
>>Gabriel Sechan wrote:
>>
>>>Most of the time, you're writing program either for yourself, or your
>>>company. Internal apps (at least those I work on) rarely to never go
>>>overseas.
>>
>>Yeah, that is becoming decreasingly true these days. Even apps that
>>never go overseas have to deal with companies and products from
>>overseas, and often representing them in ASCII is highly error prone (as
>>indicated by the Tchaikovsky example).
> 
> I don't think the problem is ASCII -- that's the sort of simple mapping 
> that's capable of being well-defined and standardized.


You're right that the problem isn't ASCII. The problem is that there
isn't really a canonical spelling of Tchaikovsky in ASCII, and
standardizing on one harder and more problematic than simply using the
Cyrillic representation, particularly when you need to standardize on
all the Tchaikovsky's out there.

>>                                       Basically, once you are dealing
>>with an interchange format where you are considering XML, the
>>probabilities really start to skew towards i18n issues (l10n issues
>>might not show up unless you specifically have non-local users).
>  
> i18n and l10n are examples of that simple mapping *within* a language.
> (It's not like "internationalization" is hard to spell or anything. Or
> type, if you aren't hunting-and-pecking your way around the keyboard.)

I'm not a hunt-and-pecker, and I make enough typeohs without spelling
out internationalization all the time. More importantly, I can glance at
i18n and recognize what it represents much more quickly than
internationalization.

>>>I speak only English.  So worrying about internationalization
>>>is a waste of my time.
>>
>>I speak French and English, but I can assure you the stuff where I need
>>to deal with i18n is generally not French. ;-) Basically, if what you
>>are doing is tied to the Internet, there is a distinct risk of i18n issues.
>  
> Including the use of it.  Is that microsoft.com with an oh, or some other
> glyph that *looks* exactly like an oh?
> 
> Take the confusion we have with fonts where 1 and l look the same --
> that's one of the major issues of Unicode write small.

Yes, but those problems don't go away in a world with multiple character
sets.

>>>If it comes out it needs to be added later, its cheaper to do so then
>>>on the average.
>>
>>For the most part I agree with you. The problem is when it needs to be
>>there from the get go, and one fails to recognize this, then all hell
>>breaks loose. Also, certain languages (C in particular) can make it much
>>harder for you to refactor to a different string representation without
>>introducing a lot of difficult to identify bugs. So, while going all out
>>on i18n might be a lot of wasted effort. Making sure you have the right
>>abstractions so that you can handle it later is important.
> 
> I think the problem is that unicode tried to solve the wrong problem.

Or perhaps, people looked for it as the solution to the wrong problem.

> The real problem wasn't "how do we let everyone have single-character
> glyphs", but "how do we let people write in their own language on a
> computer".

Once you deal with the problem for a while, you discover that having a
way to represent glyphs as distinct entities (which is what a character
really is) is very much a needed capability in software, and not really
seperable from the problem of letting people write in their own language
on a computer.

> Since we're ready to accept bloat at the outset, a better
> approach (to my way of thinking) would be to toss out ANSI, stick with
> ASCII, and redefine those ANSI characters as indicators for variable
> length strings that should constitute a glyph.

That is pretty similar to what UTF-8 is. The problem is that that isn't
the entire problem.

> Old software still works -- and given the correct display smarts (e.g.
> rewrite printf), works transparently.

/me falls out of chair

No, it breaks the first time it makes the assumption that a character or
glyph is exactly one byte, or at the very least fixed width (sadly there
is almost as much software that thinks fixed width 16-bit characters are
all you need as there is software that thinks fixed width 8-bit
characters are all you need). Think of how many C programs you've seen
that look for a specific byte in a string somewhere, without considering
the possibility that it might be part of a multiple byte character.
Indeed, a lot of old lexers suffer from this problem.

> You could look at the raw data if you wanted, in an unambiguous format
> that would still be readable if it were only a character here or there.
> Everyone wins, except for those who will need five-or-more character
> strings to represent a glyph.

Yup. I turns out that basically most Asian countries hate UTF-8 because
it makes their characters bigger than local character sets.

> (But as those languages use a glyph-per-word, more or less, this 
> shouldn't be a problem -- nobody was demanding that a sizable subset
> of the english dictionary be mapped into unicode space. Fair's fair.)

Actually, not all of those language use glyph-per-word, and the issue is
that there is a more compact and efficient representation. People tend
to feel slighted when they are forced in to such things while you don't
see much of a negative impact.

> Heh. My issue *is* Unicode.  I believe that Unicode was a solution that 
> was arrived at early and all the brainpower was put into making it work 
> instead of asking "is this the right thing to do?"  This is often the 
> case with smart people, I find... they *can* make it work, so they don't
> stop to think about whether it's worth it.

Unicode has evolved significantly from the early days, and as such it
has done a reasonable job of addressing needs that have emerged since
its inception. I find its problems are mostly of the "design by
committee" nature and the tendancy to see it as the solution to all
things, and that's hard to avoid with standards.

--Chris

-- 
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Re: Unicode and Friends (Was: JSON)

Reply via email to