On Sun, May 26, 2013 at 11:59:19AM +0200, Joakim wrote: > On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote: > >And just how exactly does that help with slicing? If anything, it > >makes slicing way hairier and error-prone than UTF-8. In fact, this > >one point alone already defeated any performance gains you may have > >had with a single-byte encoding. Now you can't do *any* slicing at > >all without convoluted algorithms to determine what encoding is where > >at the endpoints of your slice, and the resulting slice must have new > >headers to indicate the start/end of every different-language > >substring. By the time you're done with all that, you're going way > >slower than processing UTF-8. > > There are no convoluted algorithms, it's a simple check if the > string contains any two-bye encodings, a check which can be done > once and cached.
IHBT. You said that to handle multilanguage strings, your header would have a list of starting/ending points indicating which encoding should be used for which substring(s). That has nothing to do with two-byte encodings. So, please show us the code: given a string containing, say, English and French substrings, what will the header look like? And what's the algorithm to take a slice of such a string? > If it's single-byte all the way through, no problems whatsoever with > slicing. Huh?! How are there no problems with slicing? Let's say you have a string that contains both English and French. According to your scheme, you'll have some kind of header format that lets you say bytes 0-123 are English, bytes 124-129 are French, and bytes 130-200 are English. Now let's say I want a substring from 120 to 125. How would this be done? And what about if I want a substring from 120 to 140? Or 126 to 130? What if the string contains several runs of French? Please show us the code. > If there are two-byte languages included, the slice function will have > to do a little arithmetic calculation before slicing. You will also > need a few arithmetic ops to create the new header for the slice. The > point is that these operations will be much faster than decoding every > code point to slice UTF-8. You haven't proven that this "little arithmetic calculation" will be faster than manipulating UTF-8. What if I have an English text that contains quotations of Chinese, French, and Greek snippets? Math symbols? Please show us (1) how such a string should be encoded under your scheme, and (2) the code will slice such a string in an efficient way, according to your proposed encoding scheme. (And before you dismiss such a string as unlikely or write it off as rare, consider a technical math paper that cites the work of Chinese and French authors -- a rather common thing these days. You'd need the extra characters just to be able to cite their names, even if none of the actual Chinese or French is quoted verbatim. Greek in general is used all over math anyway, since for whatever reason mathematicians just love Greek symbols, so it pretty much needs to be included by default.) > >Again I say, I'm not 100% sold on UTF-8, but what you're proposing > >here is far worse. > Well, I'm glad you realize some problems with UTF-8, :) even if you > dismiss my alternative out of hand. Clearly, we're not seeing what you're seeing here. So instead of making general statements about the superiority of your scheme, you might want to show us the actual code. So far, I haven't seen anything that convinces me that your scheme is any better. In fact, from what I can see, it's a lot worse, and you're just evading pointed questions about how to address those problems. Maybe that's a wrong perception, but not having any actual code to look at, I'm having a hard time believing your claims. Right now I'm leaning towards agreeing with Walter that you're just trolling us (and rather successfully at that). So, please show us the code. Otherwise, I think I should just stop responding, as we're obviously not on the same page and this discussion isn't getting anywhere. T -- Some ideas are so stupid that only intellectuals could believe them. -- George Orwell
