On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
I have noted from the beginning that these large alphabets have to be encoded to two bytes, so it is not a true constant-width encoding if you are mixing one of those languages into a single-byte encoded string. But this "variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's variable length. You overlook that I've had to deal with this. It isn't "simpler", there's actually more work to write code that adapts to one or two byte encodings.
It is variable length, with the advantage that only strings containing a few Asian languages are variable-length, as opposed to UTF-8 having every non-English language string be variable-length. It may be more work to write library code to handle my encoding, perhaps, but efficiency and ease of use are paramount.

So let's see: first you say that my scheme has to be variable length because I
am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard Chinese. You cannot have it both ways. Code to deal with two bytes is significantly different than code to deal with one. That means you've got a conditional in your generic code - that isn't going to be faster than the conditional for UTF-8.
Hah, I have explicitly said several times that I'd use a two-byte encoding for Chinese and I already acknowledged that such a predominantly single-byte encoding is still variable-length. The problem is that _you_ try to have it both ways: first you claimed it is variable-length because I support Chinese that way, then you claimed I don't support Chinese.

Yes, there will be conditionals, just as there are several conditionals in phobos depending on whether a language supports uppercase or not. The question is whether the conditionals for single-byte encoding will execute faster than decoding every UTF-8 character. This is a matter of engineering judgement, I see no reason why you think decoding every UTF-8 character is faster.

then you claim I don't handle
these languages. This kind of blatant contradiction within two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it as irrelevant, along with more handwaving about what to do with text that has embedded words in multiple languages.
If it was mere "vague handwaving," how did you know I planned to use two bytes to encode Chinese? I'm not sure why you're continuing along this contradictory path.

I didn't "handwave" about multi-language strings, I gave specific ideas about how they might be implemented. I'm not claiming to have a bullet-proof and detailed single-byte encoding spec, just spitballing some ideas on how to do it better than the abominable UTF-8.

Worse, there are going to be more than 256 of these encodings - you can't even have a byte to specify them. Remember, Unicode has approximately 256,000 characters in it. How many code pages is that?
There are 72 modern scripts in Unicode 6.1, 28 ancient scripts, maybe another 50 symbolic sets. That leaves space for another 100 or so new scripts. Maybe you are so worried about future-proofing that you'd use two bytes to signify the alphabet, but I wouldn't. I think it's more likely that we'll ditch scripts than add them. ;) Most of those symbol sets should not be in UCS.

I was being kind saying you were trolling, as otherwise I'd be saying your scheme was, to be blunt, absurd.
I think it's absurd to use a self-synchronizing text encoding from 20 years ago, that is really only useful when streaming text, which nobody does today. There may have been a time when ASCII compatibility was paramount, when nobody cared about internationalization and almost all libraries only took ASCII input: that is not the case today.

I'll be the first to admit that a lot of great ideas have been initially dismissed by the experts as absurd. If you really believe in this, I recommend that you write it up as a real article, taking care to fill in all the handwaving with something specific, and include some benchmarks to prove your performance claims. Post your article on reddit, stackoverflow, hackernews, etc., and look for fertile ground for it. I'm sorry you're not finding fertile ground here (so far, nobody has agreed with any of your points), and this is the wrong place for such proposals anyway, as D is simply not going to switch over to it.
Let me admit in return that I might be completely wrong about my single-byte encoding representing a step forward from UTF-8. While this argument has produced no argument that I'm wrong, it's possible we've all missed something salient, some deal-breaker. As I said before, I'm not proposing that D "switch over." I was simply asking people who know or at the very least use UTF-8 more than most, as a result of employing one of the few languages with Unicode support baked in, why they think UTF-8 is a good idea.

I was hoping for a technical discussion on the merits, before I went ahead and implemented this single-byte encoding. Since nobody has been able to point out a reason for why my encoding wouldn't be much better than UTF-8, I see no reason not to go forward with my implementation. I may write something up after implementation: most people don't care about ideas, only results, to the point where almost nobody can reason at all about ideas.

Remember, extraordinary claims require extraordinary evidence, not handwaving and assumptions disguised as bold assertions.
I don't think my claims are extraordinary or backed by "handwaving and assumptions." Some people can reason about such possible encodings, even in the incomplete form I've sketched out, without having implemented them, if they know what they're doing.

On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
On 5/25/2013 2:51 PM, Walter Bright wrote:
On 5/25/2013 12:51 PM, Joakim wrote:
For a multi-language string encoding, the header would
contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string. So, a list of languages and a list of pure
single-language substrings.

Please implement the simple C function strstr() with this simple scheme, and
post it here.

http://www.digitalmars.com/rtl/string.html#strstr

I'll go first. Here's a simple UTF-8 version in C. It's not the fastest way to do it, but at least it is correct:
----------------------------------
char *strstr(const char *s1,const char *s2) {
   size_t len1 = strlen(s1);
   size_t len2 = strlen(s2);
   if (!len2)
       return (char *) s1;
   char c2 = *s2;
   while (len2 <= len1) {
       if (c2 == *s1)
           if (memcmp(s2,s1,len2) == 0)
               return (char *) s1;
       s1++;
       len1--;
   }
   return NULL;
}
There is no question that a UTF-8 implementation of strstr can be simpler to write in C and D for multi-language strings that include Korean/Chinese/Japanese. But while the strstr implementation for my encoding would contain more conditionals and lines of code, it would be far more efficient. For instance, because you know where all the language substrings are from the header, you can potentially rule out searching vast swathes of the string, because they don't contain the same languages or lengths as the string you're searching for.

Even if you're searching a single-language string, which won't have those speedups, your naive implementation checks every byte, even continuation bytes, in UTF-8 to see if they might match the first letter of the search string, even though no continuation byte will match. You can avoid this by partially decoding the leading bytes of UTF-8 characters and skipping over continuation bytes, as I've mentioned earlier in this thread, but you've then added more lines of code to your pretty yet simple function and added decoding overhead to every iteration of the while loop.

My single-byte encoding has none of these problems, in fact, it's much faster and uses less memory for the same function, while providing additional speedups, from the header, that are not available to UTF-8.

Finally, being able to write simple yet inefficient functions like this is not the test of a good encoding, as strstr is a library function, and making library developers' lives easier is a low priority for any good format. The primary goals are ease of use for library consumers, ie app developers, and speed and efficiency of the code. You are trading on the latter two for the former with this implementation. That is not a good tradeoff.

Perhaps it was a good trade 20 years ago when everyone rolled their own code and nobody bothered waiting for those floppy disks to arrive with expensive library code. It is not a good trade today.

I suggest you make an attempt at writing strstr and post it. Code speaks louder than words.

Reply via email to