On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
I have noted from the beginning that these large alphabets
have to be encoded to
two bytes, so it is not a true constant-width encoding if you
are mixing one of
those languages into a single-byte encoded string. But this
"variable length"
encoding is so much simpler than UTF-8, there's no comparison.
If it's one byte sometimes, or two bytes sometimes, it's
variable length. You overlook that I've had to deal with this.
It isn't "simpler", there's actually more work to write code
that adapts to one or two byte encodings.
It is variable length, with the advantage that only strings
containing a few Asian languages are variable-length, as
opposed to UTF-8 having every non-English language string be
variable-length. It may be more work to write library code to
handle my encoding, perhaps, but efficiency and ease of use are
paramount.
So let's see: first you say that my scheme has to be variable
length because I
am using two bytes to handle these languages,
Well, it *is* variable length or you have to disregard
Chinese. You cannot have it both ways. Code to deal with two
bytes is significantly different than code to deal with one.
That means you've got a conditional in your generic code -
that isn't going to be faster than the conditional for UTF-8.
Hah, I have explicitly said several times that I'd use a
two-byte encoding for Chinese and I already acknowledged that
such a predominantly single-byte encoding is still
variable-length. The problem is that _you_ try to have it both
ways: first you claimed it is variable-length because I support
Chinese that way, then you claimed I don't support Chinese.
Yes, there will be conditionals, just as there are several
conditionals in phobos depending on whether a language supports
uppercase or not. The question is whether the conditionals for
single-byte encoding will execute faster than decoding every
UTF-8 character. This is a matter of engineering judgement, I
see no reason why you think decoding every UTF-8 character is
faster.
then you claim I don't handle
these languages. This kind of blatant contradiction within
two posts can only
be called... trolling!
You gave some vague handwaving about it, and then dismissed it
as irrelevant, along with more handwaving about what to do
with text that has embedded words in multiple languages.
If it was mere "vague handwaving," how did you know I planned
to use two bytes to encode Chinese? I'm not sure why you're
continuing along this contradictory path.
I didn't "handwave" about multi-language strings, I gave
specific ideas about how they might be implemented. I'm not
claiming to have a bullet-proof and detailed single-byte
encoding spec, just spitballing some ideas on how to do it
better than the abominable UTF-8.
Worse, there are going to be more than 256 of these encodings
- you can't even have a byte to specify them. Remember,
Unicode has approximately 256,000 characters in it. How many
code pages is that?
There are 72 modern scripts in Unicode 6.1, 28 ancient scripts,
maybe another 50 symbolic sets. That leaves space for another
100 or so new scripts. Maybe you are so worried about
future-proofing that you'd use two bytes to signify the
alphabet, but I wouldn't. I think it's more likely that we'll
ditch scripts than add them. ;) Most of those symbol sets
should not be in UCS.
I was being kind saying you were trolling, as otherwise I'd be
saying your scheme was, to be blunt, absurd.
I think it's absurd to use a self-synchronizing text encoding
from 20 years ago, that is really only useful when streaming
text, which nobody does today. There may have been a time when
ASCII compatibility was paramount, when nobody cared about
internationalization and almost all libraries only took ASCII
input: that is not the case today.
I'll be the first to admit that a lot of great ideas have been
initially dismissed by the experts as absurd. If you really
believe in this, I recommend that you write it up as a real
article, taking care to fill in all the handwaving with
something specific, and include some benchmarks to prove your
performance claims. Post your article on reddit,
stackoverflow, hackernews, etc., and look for fertile ground
for it. I'm sorry you're not finding fertile ground here (so
far, nobody has agreed with any of your points), and this is
the wrong place for such proposals anyway, as D is simply not
going to switch over to it.
Let me admit in return that I might be completely wrong about
my single-byte encoding representing a step forward from UTF-8.
While this argument has produced no argument that I'm wrong,
it's possible we've all missed something salient, some
deal-breaker. As I said before, I'm not proposing that D
"switch over." I was simply asking people who know or at the
very least use UTF-8 more than most, as a result of employing
one of the few languages with Unicode support baked in, why
they think UTF-8 is a good idea.
I was hoping for a technical discussion on the merits, before I
went ahead and implemented this single-byte encoding. Since
nobody has been able to point out a reason for why my encoding
wouldn't be much better than UTF-8, I see no reason not to go
forward with my implementation. I may write something up after
implementation: most people don't care about ideas, only
results, to the point where almost nobody can reason at all
about ideas.
Remember, extraordinary claims require extraordinary evidence,
not handwaving and assumptions disguised as bold assertions.
I don't think my claims are extraordinary or backed by
"handwaving and assumptions." Some people can reason about
such possible encodings, even in the incomplete form I've
sketched out, without having implemented them, if they know
what they're doing.
On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
On 5/25/2013 2:51 PM, Walter Bright wrote:
On 5/25/2013 12:51 PM, Joakim wrote:
For a multi-language string encoding, the header would
contain a single byte for every language used in the string,
along with multiple
index bytes to signify the start and finish of every run of
single-language
characters in the string. So, a list of languages and a list
of pure
single-language substrings.
Please implement the simple C function strstr() with this
simple scheme, and
post it here.
http://www.digitalmars.com/rtl/string.html#strstr
I'll go first. Here's a simple UTF-8 version in C. It's not
the fastest way to do it, but at least it is correct:
----------------------------------
char *strstr(const char *s1,const char *s2) {
size_t len1 = strlen(s1);
size_t len2 = strlen(s2);
if (!len2)
return (char *) s1;
char c2 = *s2;
while (len2 <= len1) {
if (c2 == *s1)
if (memcmp(s2,s1,len2) == 0)
return (char *) s1;
s1++;
len1--;
}
return NULL;
}
There is no question that a UTF-8 implementation of strstr can
be simpler to write in C and D for multi-language strings that
include Korean/Chinese/Japanese. But while the strstr
implementation for my encoding would contain more conditionals
and lines of code, it would be far more efficient. For
instance, because you know where all the language substrings
are from the header, you can potentially rule out searching
vast swathes of the string, because they don't contain the same
languages or lengths as the string you're searching for.
Even if you're searching a single-language string, which won't
have those speedups, your naive implementation checks every
byte, even continuation bytes, in UTF-8 to see if they might
match the first letter of the search string, even though no
continuation byte will match. You can avoid this by partially
decoding the leading bytes of UTF-8 characters and skipping
over continuation bytes, as I've mentioned earlier in this
thread, but you've then added more lines of code to your pretty
yet simple function and added decoding overhead to every
iteration of the while loop.
My single-byte encoding has none of these problems, in fact,
it's much faster and uses less memory for the same function,
while providing additional speedups, from the header, that are
not available to UTF-8.
Finally, being able to write simple yet inefficient functions
like this is not the test of a good encoding, as strstr is a
library function, and making library developers' lives easier
is a low priority for any good format. The primary goals are
ease of use for library consumers, ie app developers, and speed
and efficiency of the code. You are trading on the latter two
for the former with this implementation. That is not a good
tradeoff.
Perhaps it was a good trade 20 years ago when everyone rolled
their own code and nobody bothered waiting for those floppy
disks to arrive with expensive library code. It is not a good
trade today.
I服了u,I'm thinking of your name means joking?