Re: Unicode and Friends (Was: JSON)

Christopher Smith Wed, 26 Oct 2005 01:25:49 -0700

Stewart Stremler wrote:

begin  quoting Andrew Lentvorski as of Tue, Oct 25, 2005 at 05:26:28PM -0700:

Stewart Stremler wrote:

I think the problem is that unicode tried to solve the wrong problem.
The real problem wasn't "how do we let everyone have single-character
glyphs", but "how do we let people write in their own language on a
computer".  Since we're ready to accept bloat at the outset, a better
approach (to my way of thinking) would be to toss out ANSI, stick with
ASCII, and redefine those ANSI characters as indicators for variable
length strings that should constitute a glyph.

That pretty much describes UTF-8. So, what is your particular beef withUTF-8?


UTF-8 doesn't use printable characters.

Well, neither does ASCII, hence why even in the ASCII days it was handyto have a shorthand for "printable characters" when building grammarsand regexps. It turns out being able to support "non-printable"characters is handy.

Consequently, if I see a UTF-8 "sequence", I get a ? or an empty box,
and NO way to tell what's actually there without installing some sort
of appropriate font.  (Well, I can dump it to a file and use od...)

Actually, if your system had a font with full unicode coverage (suchbeasties do exist), there'd be no question mark. Furthermore, not allsoftware displays a question mark. Some chose to render the unicodecharacter value in octal (sometimes hex, although that is uncommon). Ofcourse, this often turns out to be less helpful than the question mark. ;-)

UTF-8 tries to make sure that nothing not an ASCII character looks
like an ASCII character; I'm not entirely convinced that this isan important issue. Perhaps it is and I just haven't grok'd the need.

Yeah, that was actually a very deliberate decision, and if you thinkabout it, it turns out to be very important if you are trying to make itso UTF-8 can be dropped in to software that is used to dealing withASCII with minimal consequences. It also makes parsing latin-1 stuff farmore efficient.

Heh. My issue *is* Unicode. I believe that Unicode was a solution thatwas arrived at early and all the brainpower was put into making it workinstead of asking "is this the right thing to do?" This is often thecase with smart people, I find... they *can* make it work, so they don't
stop to think about whether it's worth it.
I disagree. Completely. Unicode means that I can just have a single"String" abstraction that works across multiple human and computerlanguages.
UTF8 does give you that. UTF-16 (or is it UCS-16?) doesn't.

So it's not the string abstraction that's the problem, it's the encoding
of glyphs.  Wide-characters seem to be the most common implementation,
and they *suck*.

Wide characters are not the most common implementation. UTF-8 is.

While UTF-16 (or UCS-2 btw) has a different set of advantages anddisadvantages than UTF-8, (as does UTF-32), I can't see how it impactson your ability to have a "String" abstraction that works acrossmultiple human and computer languages.

I don't disagree that everyone's glyphs should be represented. But
Unicode even compromised on that.  We have *simplified* collections of
glyphs.

Actually, they have the simplified collections and more extendedcollections as well. In many cases Unicode handles a broader set ofglyphs than other encoding formats.

And Unicode introduces *another* problem -- the problem of too-similiar
glyphs *explodes*.  This is a security issue -- a boon to phishers all
over the world.  If I can't set my locale (or toggle my display) so that
the extended character sequences show up as non-ambiguous character
sequences, I have a problem from the whole mess from the standpoint as
a user.

Wait, up above you were claiming that any extended character sequencesare presented as question marks.... that would seem to really screw aphisher if you ask me. ;-)

That said, TLS/SSL certificates are *supposed* to be managed in such away that getting a certificate for such a domain should be impossible,and ultimately, only the certificate can be trusted. Since people don'tlook at the certificate, they're already taking a huge risk and exposingthemselves to phishing, unicode or no.


--Chris

--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Re: Unicode and Friends (Was: JSON)

Reply via email to