Re: Why UTF-8/16 character encodings?

Joakim Sat, 25 May 2013 11:30:31 -0700

On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:

25-May-2013 10:44, Joakim пишет:
Yes, on the encoding, if it's a variable-length encoding likeUTF-8, no,on the code space. I was originally going to title my post,"WhyUnicode?" but I have no real problem with UCS, which merelystandardizeda bunch of pre-existing code pages. Perhaps there are a lotof problems
with UCS also, I just haven't delved into it enough to know.
UCS is dead and gone. Next in line to "640K is enough foreveryone".

I think you are confused. UCS refers to the Universal CharacterSet, which is the backbone of Unicode:


http://en.wikipedia.org/wiki/Universal_Character_Set

You might be thinking of the unpopular UCS-2 and UCS-4 encodings,which I have never referred to.

Separate code spaces were the case before Unicode (andutf-8). Theproblem is not only that without header text is meaningless(no easyslicing) but the fact that encoding of data after headerstronglydepends a variety of factors - a list of encodings actually.Noweverybody has to keep a (code) page per language to at leastknow ifit's 2 bytes per char or 1 byte per char or whatever. And youstillwork on a basis that there is no combining marks and regionalspecific
stuff :)
Everybody is still keeping code pages, UTF-8 hasn't changedthat.
Legacy. Hard to switch overnight. There are graphs thatindicate that few years from now you might never encounter alegacy encoding anymore, only UTF-8/UTF-16.

I didn't mean that people are literally keeping code pages. Imeant that there's not much of a difference between code pageswith 2 bytes per char and the language character sets in UCS.

Does
UTF-8 not need "to at least know if it's 2 bytes per char or 1byte per
char or whatever?"
It's coherent in its scheme to determine that. You don't needextra information synced to text unlike header stuff.

?! It's okay because you deem it "coherent in its scheme?" Ideem headers much more coherent. :)

It has to do that also. Everyone keeps talking about
"easy slicing" as though UTF-8 provides it, but it doesn't.Phobosturns UTF-8 into UTF-32 internally for all that ease of use,at leastdoubling your string size in the process. Correct me if I'mwrong, that
was what I read on the newsgroup sometime back.
Indeed you are - searching for UTF-8 substring in UTF-8 stringdoesn't do any decoding and it does return you a slice of abalance of original.

Perhaps substring search doesn't strictly require decoding butyou have changed the subject: slicing does require decoding andthat's the use case you brought up to begin with. I haven'tlooked into it, but I suspect substring search not requiringdecoding is the exception for UTF-8 algorithms, not the rule.

??? Simply makes no sense. There is no intersection betweensome legacy encodings as of now. Or do you want to add N*(N-1)cross-encodings for any combination of 2? What about 3 in onestring?

I sketched two possible encodings above, none of which wouldrequire "cross-encodings."

We want monoculture! That is to understand each without allthese"par-le-vu-france?" and codepages of variouscomplexity(insanity).
I hate monoculture, but then I haven't had to decipher somescrewed-up
codepage in the middle of the night. ;)
So you never had trouble of internationalization? Whatlanguages do you use (read/speak/etc.)?

This was meant as a point in your favor, conceding that I haven'thad to code with the terrible code pages system from the past. Ican read and speak multiple languages, but I don't use anythingother than English text.

That said, you could standardize
on UCS for your code space without using a bad encoding likeUTF-8, as I
said above.
UCS is a myth as of ~5 years ago. Early adopters of Unicodefell into that trap (Java, Windows NT). You shouldn't.

UCS, the character set, as noted above. If that's a myth,Unicode is a myth. :)

This is it but it's far more flexible in a sense that it allowsmulti-linguagal strings just fine and lone full-with unicodecodepoints as well.

That's only because it uses a more complex header than a singlebyte for the language, which I noted could be done with myscheme, by adding a more complex header, long before youmentioned this unicode compression scheme.

But I get the impression that it's only for sending over
the wire, ie transmision, so all the processing issues thatUTF-8
introduces would still be there.
Use mime-type etc. Standards are always a bit stringy andsuboptimal, their acceptance rate is one of chief advantagesthey have. Unicode has horrifically large momentum now and nota single organization aside from them tries to do this dirtywork (=i18n).

You misunderstand. I was saying that this unicode compressionscheme doesn't help you with string processing, it is only fortransmission and is probably fine for that, precisely because itseems to implement some version of my single-byte encodingscheme! You do raise a good point: the only reason why we'relikely using such a bad encoding in UTF-8 is that nobody elsewants to tackle this hairy problem.

Consider adding another encoding for "Tuva" for isntance. Nowyou have to add 2*n conversion routines to match it to othercodepages/locales.

Not sure what you're referring to here.

Beyond that - there are many things to consider ininternationalization and you would have to special case themall by codepage.

Not necessarily. But that is actually one of the advantages ofsingle-byte encodings, as I have noted above. toUpper is a NOPfor a single-byte encoding string with an Asian script, you can'tdo that with a UTF-8 string.

If they're screwing up something so simple,
imagine how much worse everyone is screwing up somethingcomplex like
UTF-8?
UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF]to a sequence of octets. It does it pretty well and compatiblewith ASCII, even the little rant you posted acknowledged that.Now you are either against Unicode as whole or what?

The BOM link I gave notes that UTF-8 isn't alwaysASCII-compatible.

There are two parts to Unicode. I don't know enough about UCS,the character set, ;) to be for it or against it, but Iacknowledge that a standardized character set may make sense. Iam dead set against the UTF-8 variable-width encoding, for allthe reasons listed above.


On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote:

25-May-2013 13:05, Joakim пишет:
Nobody is talking about going back to code pages. I'm talkingaboutgoing to single-byte encodings, which do not imply theproblems that you
had with code pages way back when.
Problem is what you outline is isomorphic with code-pages.Hence the grief of accumulated experience against them.

They may seem superficially similar but they're not. Forexample, from the beginning, I have suggested a more complexheader that can enable multi-language strings, as one possiblesolution. I don't think code pages provided that.

Well if somebody get a quest to redefine UTF-8 they *might*come up with something that is a bit faster to decode butshares the same properties. Hardly a life saver anyway.

Perhaps not, but I suspect programmers will flock to aconstant-width encoding that is much simpler and more efficientthan UTF-8. Programmer productivity is the biggest loss from thecomplexity of UTF-8, as I've noted before.

The world may not "abandon Unicode," but it will abandonUTF-8, becauseit's a dumb idea. Unfortunately, such dumb ideas- XMLanyone?- oftenproliferate until someone comes up with something better toshow how
dumb they are.
Even children know XML is awful redundant shit as interchangeformat. The hierarchical document is a nice idea anyway.

_We_ both know that, but many others don't, or XML wouldn't be aspopular as it is. ;) I'm making a similar point about the morelimited success of UTF-8, ie it's still shit.

Re: Why UTF-8/16 character encodings?

Reply via email to