On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:
25-May-2013 10:44, Joakim пишет:
Yes, on the encoding, if it's a variable-length encoding like
UTF-8, no,
on the code space. I was originally going to title my post,
"Why
Unicode?" but I have no real problem with UCS, which merely
standardized
a bunch of pre-existing code pages. Perhaps there are a lot
of problems
with UCS also, I just haven't delved into it enough to know.
UCS is dead and gone. Next in line to "640K is enough for
everyone".
I think you are confused. UCS refers to the Universal Character
Set, which is the backbone of Unicode:
http://en.wikipedia.org/wiki/Universal_Character_Set
You might be thinking of the unpopular UCS-2 and UCS-4 encodings,
which I have never referred to.
Separate code spaces were the case before Unicode (and
utf-8). The
problem is not only that without header text is meaningless
(no easy
slicing) but the fact that encoding of data after header
strongly
depends a variety of factors - a list of encodings actually.
Now
everybody has to keep a (code) page per language to at least
know if
it's 2 bytes per char or 1 byte per char or whatever. And you
still
work on a basis that there is no combining marks and regional
specific
stuff :)
Everybody is still keeping code pages, UTF-8 hasn't changed
that.
Legacy. Hard to switch overnight. There are graphs that
indicate that few years from now you might never encounter a
legacy encoding anymore, only UTF-8/UTF-16.
I didn't mean that people are literally keeping code pages. I
meant that there's not much of a difference between code pages
with 2 bytes per char and the language character sets in UCS.
Does
UTF-8 not need "to at least know if it's 2 bytes per char or 1
byte per
char or whatever?"
It's coherent in its scheme to determine that. You don't need
extra information synced to text unlike header stuff.
?! It's okay because you deem it "coherent in its scheme?" I
deem headers much more coherent. :)
It has to do that also. Everyone keeps talking about
"easy slicing" as though UTF-8 provides it, but it doesn't.
Phobos
turns UTF-8 into UTF-32 internally for all that ease of use,
at least
doubling your string size in the process. Correct me if I'm
wrong, that
was what I read on the newsgroup sometime back.
Indeed you are - searching for UTF-8 substring in UTF-8 string
doesn't do any decoding and it does return you a slice of a
balance of original.
Perhaps substring search doesn't strictly require decoding but
you have changed the subject: slicing does require decoding and
that's the use case you brought up to begin with. I haven't
looked into it, but I suspect substring search not requiring
decoding is the exception for UTF-8 algorithms, not the rule.
??? Simply makes no sense. There is no intersection between
some legacy encodings as of now. Or do you want to add N*(N-1)
cross-encodings for any combination of 2? What about 3 in one
string?
I sketched two possible encodings above, none of which would
require "cross-encodings."
We want monoculture! That is to understand each without all
these
"par-le-vu-france?" and codepages of various
complexity(insanity).
I hate monoculture, but then I haven't had to decipher some
screwed-up
codepage in the middle of the night. ;)
So you never had trouble of internationalization? What
languages do you use (read/speak/etc.)?
This was meant as a point in your favor, conceding that I haven't
had to code with the terrible code pages system from the past. I
can read and speak multiple languages, but I don't use anything
other than English text.
That said, you could standardize
on UCS for your code space without using a bad encoding like
UTF-8, as I
said above.
UCS is a myth as of ~5 years ago. Early adopters of Unicode
fell into that trap (Java, Windows NT). You shouldn't.
UCS, the character set, as noted above. If that's a myth,
Unicode is a myth. :)
This is it but it's far more flexible in a sense that it allows
multi-linguagal strings just fine and lone full-with unicode
codepoints as well.
That's only because it uses a more complex header than a single
byte for the language, which I noted could be done with my
scheme, by adding a more complex header, long before you
mentioned this unicode compression scheme.
But I get the impression that it's only for sending over
the wire, ie transmision, so all the processing issues that
UTF-8
introduces would still be there.
Use mime-type etc. Standards are always a bit stringy and
suboptimal, their acceptance rate is one of chief advantages
they have. Unicode has horrifically large momentum now and not
a single organization aside from them tries to do this dirty
work (=i18n).
You misunderstand. I was saying that this unicode compression
scheme doesn't help you with string processing, it is only for
transmission and is probably fine for that, precisely because it
seems to implement some version of my single-byte encoding
scheme! You do raise a good point: the only reason why we're
likely using such a bad encoding in UTF-8 is that nobody else
wants to tackle this hairy problem.
Consider adding another encoding for "Tuva" for isntance. Now
you have to add 2*n conversion routines to match it to other
codepages/locales.
Not sure what you're referring to here.
Beyond that - there are many things to consider in
internationalization and you would have to special case them
all by codepage.
Not necessarily. But that is actually one of the advantages of
single-byte encodings, as I have noted above. toUpper is a NOP
for a single-byte encoding string with an Asian script, you can't
do that with a UTF-8 string.
If they're screwing up something so simple,
imagine how much worse everyone is screwing up something
complex like
UTF-8?
UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF]
to a sequence of octets. It does it pretty well and compatible
with ASCII, even the little rant you posted acknowledged that.
Now you are either against Unicode as whole or what?
The BOM link I gave notes that UTF-8 isn't always
ASCII-compatible.
There are two parts to Unicode. I don't know enough about UCS,
the character set, ;) to be for it or against it, but I
acknowledge that a standardized character set may make sense. I
am dead set against the UTF-8 variable-width encoding, for all
the reasons listed above.
On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote:
25-May-2013 13:05, Joakim пишет:
Nobody is talking about going back to code pages. I'm talking
about
going to single-byte encodings, which do not imply the
problems that you
had with code pages way back when.
Problem is what you outline is isomorphic with code-pages.
Hence the grief of accumulated experience against them.
They may seem superficially similar but they're not. For
example, from the beginning, I have suggested a more complex
header that can enable multi-language strings, as one possible
solution. I don't think code pages provided that.
Well if somebody get a quest to redefine UTF-8 they *might*
come up with something that is a bit faster to decode but
shares the same properties. Hardly a life saver anyway.
Perhaps not, but I suspect programmers will flock to a
constant-width encoding that is much simpler and more efficient
than UTF-8. Programmer productivity is the biggest loss from the
complexity of UTF-8, as I've noted before.
The world may not "abandon Unicode," but it will abandon
UTF-8, because
it's a dumb idea. Unfortunately, such dumb ideas- XML
anyone?- often
proliferate until someone comes up with something better to
show how
dumb they are.
Even children know XML is awful redundant shit as interchange
format. The hierarchical document is a nice idea anyway.
_We_ both know that, but many others don't, or XML wouldn't be as
popular as it is. ;) I'm making a similar point about the more
limited success of UTF-8, ie it's still shit.