Re: Why UTF-8/16 character encodings?

John Colvin Sun, 26 May 2013 05:50:23 -0700

On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:

On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
I have noted from the beginning that these large alphabetshave to be encoded totwo bytes, so it is not a true constant-width encoding if youare mixing one ofthose languages into a single-byte encoded string. But this"variable length"
encoding is so much simpler than UTF-8, there's no comparison.
If it's one byte sometimes, or two bytes sometimes, it'svariable length. You overlook that I've had to deal with this.It isn't "simpler", there's actually more work to write codethat adapts to one or two byte encodings.
It is variable length, with the advantage that only stringscontaining a few Asian languages are variable-length, asopposed to UTF-8 having every non-English language string bevariable-length. It may be more work to write library code tohandle my encoding, perhaps, but efficiency and ease of use areparamount.
So let's see: first you say that my scheme has to be variablelength because I
am using two bytes to handle these languages,
Well, it *is* variable length or you have to disregardChinese. You cannot have it both ways. Code to deal with twobytes is significantly different than code to deal with one.That means you've got a conditional in your generic code -that isn't going to be faster than the conditional for UTF-8.
Hah, I have explicitly said several times that I'd use atwo-byte encoding for Chinese and I already acknowledged thatsuch a predominantly single-byte encoding is stillvariable-length. The problem is that _you_ try to have it bothways: first you claimed it is variable-length because I supportChinese that way, then you claimed I don't support Chinese.
Yes, there will be conditionals, just as there are severalconditionals in phobos depending on whether a language supportsuppercase or not. The question is whether the conditionals forsingle-byte encoding will execute faster than decoding everyUTF-8 character. This is a matter of engineering judgement, Isee no reason why you think decoding every UTF-8 character isfaster.
then you claim I don't handle
these languages. This kind of blatant contradiction withintwo posts can only
be called... trolling!
You gave some vague handwaving about it, and then dismissed itas irrelevant, along with more handwaving about what to dowith text that has embedded words in multiple languages.
If it was mere "vague handwaving," how did you know I plannedto use two bytes to encode Chinese? I'm not sure why you'recontinuing along this contradictory path.
I didn't "handwave" about multi-language strings, I gavespecific ideas about how they might be implemented. I'm notclaiming to have a bullet-proof and detailed single-byteencoding spec, just spitballing some ideas on how to do itbetter than the abominable UTF-8.
Worse, there are going to be more than 256 of these encodings- you can't even have a byte to specify them. Remember,Unicode has approximately 256,000 characters in it. How manycode pages is that?
There are 72 modern scripts in Unicode 6.1, 28 ancient scripts,maybe another 50 symbolic sets. That leaves space for another100 or so new scripts. Maybe you are so worried aboutfuture-proofing that you'd use two bytes to signify thealphabet, but I wouldn't. I think it's more likely that we'llditch scripts than add them. ;) Most of those symbol setsshould not be in UCS.
I was being kind saying you were trolling, as otherwise I'd besaying your scheme was, to be blunt, absurd.
I think it's absurd to use a self-synchronizing text encodingfrom 20 years ago, that is really only useful when streamingtext, which nobody does today. There may have been a time whenASCII compatibility was paramount, when nobody cared aboutinternationalization and almost all libraries only took ASCIIinput: that is not the case today.
I'll be the first to admit that a lot of great ideas have beeninitially dismissed by the experts as absurd. If you reallybelieve in this, I recommend that you write it up as a realarticle, taking care to fill in all the handwaving withsomething specific, and include some benchmarks to prove yourperformance claims. Post your article on reddit,stackoverflow, hackernews, etc., and look for fertile groundfor it. I'm sorry you're not finding fertile ground here (sofar, nobody has agreed with any of your points), and this isthe wrong place for such proposals anyway, as D is simply notgoing to switch over to it.
Let me admit in return that I might be completely wrong aboutmy single-byte encoding representing a step forward from UTF-8.While this argument has produced no argument that I'm wrong,it's possible we've all missed something salient, somedeal-breaker. As I said before, I'm not proposing that D"switch over." I was simply asking people who know or at thevery least use UTF-8 more than most, as a result of employingone of the few languages with Unicode support baked in, whythey think UTF-8 is a good idea.
I was hoping for a technical discussion on the merits, before Iwent ahead and implemented this single-byte encoding. Sincenobody has been able to point out a reason for why my encodingwouldn't be much better than UTF-8, I see no reason not to goforward with my implementation. I may write something up afterimplementation: most people don't care about ideas, onlyresults, to the point where almost nobody can reason at allabout ideas.
Remember, extraordinary claims require extraordinary evidence,not handwaving and assumptions disguised as bold assertions.
I don't think my claims are extraordinary or backed by"handwaving and assumptions." Some people can reason aboutsuch possible encodings, even in the incomplete form I'vesketched out, without having implemented them, if they knowwhat they're doing.
On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
On 5/25/2013 2:51 PM, Walter Bright wrote:
On 5/25/2013 12:51 PM, Joakim wrote:
For a multi-language string encoding, the header would
contain a single byte for every language used in the string,along with multipleindex bytes to signify the start and finish of every run ofsingle-languagecharacters in the string. So, a list of languages and a listof pure
single-language substrings.
Please implement the simple C function strstr() with thissimple scheme, and
post it here.

http://www.digitalmars.com/rtl/string.html#strstr
I'll go first. Here's a simple UTF-8 version in C. It's notthe fastest way to do it, but at least it is correct:
----------------------------------
char *strstr(const char *s1,const char *s2) {
   size_t len1 = strlen(s1);
   size_t len2 = strlen(s2);
   if (!len2)
       return (char *) s1;
   char c2 = *s2;
   while (len2 <= len1) {
       if (c2 == *s1)
           if (memcmp(s2,s1,len2) == 0)
               return (char *) s1;
       s1++;
       len1--;
   }
   return NULL;
}
There is no question that a UTF-8 implementation of strstr canbe simpler to write in C and D for multi-language strings thatinclude Korean/Chinese/Japanese. But while the strstrimplementation for my encoding would contain more conditionalsand lines of code, it would be far more efficient. Forinstance, because you know where all the language substringsare from the header, you can potentially rule out searchingvast swathes of the string, because they don't contain the samelanguages or lengths as the string you're searching for.
Even if you're searching a single-language string, which won'thave those speedups, your naive implementation checks everybyte, even continuation bytes, in UTF-8 to see if they mightmatch the first letter of the search string, even though nocontinuation byte will match. You can avoid this by partiallydecoding the leading bytes of UTF-8 characters and skippingover continuation bytes, as I've mentioned earlier in thisthread, but you've then added more lines of code to your prettyyet simple function and added decoding overhead to everyiteration of the while loop.
My single-byte encoding has none of these problems, in fact,it's much faster and uses less memory for the same function,while providing additional speedups, from the header, that arenot available to UTF-8.
Finally, being able to write simple yet inefficient functionslike this is not the test of a good encoding, as strstr is alibrary function, and making library developers' lives easieris a low priority for any good format. The primary goals areease of use for library consumers, ie app developers, and speedand efficiency of the code. You are trading on the latter twofor the former with this implementation. That is not a goodtradeoff.
Perhaps it was a good trade 20 years ago when everyone rolledtheir own code and nobody bothered waiting for those floppydisks to arrive with expensive library code. It is not a goodtrade today.

I suggest you make an attempt at writing strstr and post it. Codespeaks louder than words.

Re: Why UTF-8/16 character encodings?

Reply via email to