On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:
25-May-2013 10:44, Joakim пишет:
Yes, on the encoding, if it's a variable-length encoding like UTF-8, no, on the code space. I was originally going to title my post, "Why Unicode?" but I have no real problem with UCS, which merely standardized a bunch of pre-existing code pages. Perhaps there are a lot of problems
with UCS also, I just haven't delved into it enough to know.

UCS is dead and gone. Next in line to "640K is enough for everyone".
I think you are confused. UCS refers to the Universal Character Set, which is the backbone of Unicode:

http://en.wikipedia.org/wiki/Universal_Character_Set

You might be thinking of the unpopular UCS-2 and UCS-4 encodings, which I have never referred to.

Separate code spaces were the case before Unicode (and utf-8). The problem is not only that without header text is meaningless (no easy slicing) but the fact that encoding of data after header strongly depends a variety of factors - a list of encodings actually. Now everybody has to keep a (code) page per language to at least know if it's 2 bytes per char or 1 byte per char or whatever. And you still work on a basis that there is no combining marks and regional specific
stuff :)
Everybody is still keeping code pages, UTF-8 hasn't changed that.

Legacy. Hard to switch overnight. There are graphs that indicate that few years from now you might never encounter a legacy encoding anymore, only UTF-8/UTF-16.
I didn't mean that people are literally keeping code pages. I meant that there's not much of a difference between code pages with 2 bytes per char and the language character sets in UCS.

Does
UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per
char or whatever?"

It's coherent in its scheme to determine that. You don't need extra information synced to text unlike header stuff.
?! It's okay because you deem it "coherent in its scheme?" I deem headers much more coherent. :)

It has to do that also. Everyone keeps talking about
"easy slicing" as though UTF-8 provides it, but it doesn't. Phobos turns UTF-8 into UTF-32 internally for all that ease of use, at least doubling your string size in the process. Correct me if I'm wrong, that
was what I read on the newsgroup sometime back.

Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't do any decoding and it does return you a slice of a balance of original.
Perhaps substring search doesn't strictly require decoding but you have changed the subject: slicing does require decoding and that's the use case you brought up to begin with. I haven't looked into it, but I suspect substring search not requiring decoding is the exception for UTF-8 algorithms, not the rule.

??? Simply makes no sense. There is no intersection between some legacy encodings as of now. Or do you want to add N*(N-1) cross-encodings for any combination of 2? What about 3 in one string?
I sketched two possible encodings above, none of which would require "cross-encodings."

We want monoculture! That is to understand each without all these "par-le-vu-france?" and codepages of various complexity(insanity).
I hate monoculture, but then I haven't had to decipher some screwed-up
codepage in the middle of the night. ;)

So you never had trouble of internationalization? What languages do you use (read/speak/etc.)?
This was meant as a point in your favor, conceding that I haven't had to code with the terrible code pages system from the past. I can read and speak multiple languages, but I don't use anything other than English text.

That said, you could standardize
on UCS for your code space without using a bad encoding like UTF-8, as I
said above.

UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into that trap (Java, Windows NT). You shouldn't.
UCS, the character set, as noted above. If that's a myth, Unicode is a myth. :)

This is it but it's far more flexible in a sense that it allows multi-linguagal strings just fine and lone full-with unicode codepoints as well.
That's only because it uses a more complex header than a single byte for the language, which I noted could be done with my scheme, by adding a more complex header, long before you mentioned this unicode compression scheme.

But I get the impression that it's only for sending over
the wire, ie transmision, so all the processing issues that UTF-8
introduces would still be there.

Use mime-type etc. Standards are always a bit stringy and suboptimal, their acceptance rate is one of chief advantages they have. Unicode has horrifically large momentum now and not a single organization aside from them tries to do this dirty work (=i18n).
You misunderstand. I was saying that this unicode compression scheme doesn't help you with string processing, it is only for transmission and is probably fine for that, precisely because it seems to implement some version of my single-byte encoding scheme! You do raise a good point: the only reason why we're likely using such a bad encoding in UTF-8 is that nobody else wants to tackle this hairy problem.

Consider adding another encoding for "Tuva" for isntance. Now you have to add 2*n conversion routines to match it to other codepages/locales.
Not sure what you're referring to here.

Beyond that - there are many things to consider in internationalization and you would have to special case them all by codepage.
Not necessarily. But that is actually one of the advantages of single-byte encodings, as I have noted above. toUpper is a NOP for a single-byte encoding string with an Asian script, you can't do that with a UTF-8 string.

If they're screwing up something so simple,
imagine how much worse everyone is screwing up something complex like
UTF-8?

UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a sequence of octets. It does it pretty well and compatible with ASCII, even the little rant you posted acknowledged that. Now you are either against Unicode as whole or what?
The BOM link I gave notes that UTF-8 isn't always ASCII-compatible.

There are two parts to Unicode. I don't know enough about UCS, the character set, ;) to be for it or against it, but I acknowledge that a standardized character set may make sense. I am dead set against the UTF-8 variable-width encoding, for all the reasons listed above.

On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote:
25-May-2013 13:05, Joakim пишет:
Nobody is talking about going back to code pages. I'm talking about going to single-byte encodings, which do not imply the problems that you
had with code pages way back when.

Problem is what you outline is isomorphic with code-pages. Hence the grief of accumulated experience against them.
They may seem superficially similar but they're not. For example, from the beginning, I have suggested a more complex header that can enable multi-language strings, as one possible solution. I don't think code pages provided that.

Well if somebody get a quest to redefine UTF-8 they *might* come up with something that is a bit faster to decode but shares the same properties. Hardly a life saver anyway.
Perhaps not, but I suspect programmers will flock to a constant-width encoding that is much simpler and more efficient than UTF-8. Programmer productivity is the biggest loss from the complexity of UTF-8, as I've noted before.

The world may not "abandon Unicode," but it will abandon UTF-8, because it's a dumb idea. Unfortunately, such dumb ideas- XML anyone?- often proliferate until someone comes up with something better to show how
dumb they are.

Even children know XML is awful redundant shit as interchange format. The hierarchical document is a nice idea anyway.
_We_ both know that, but many others don't, or XML wouldn't be as popular as it is. ;) I'm making a similar point about the more limited success of UTF-8, ie it's still shit.

Reply via email to