Mike Gran <spk...@yahoo.com> writes: >> From:Mark H Weaver <m...@netris.org> >> Despite the similarity of these two representations, they are >> sufficiently different that they cannot be handled by the same machine >> code. That means you must either implement multiple inner loops, one >> for each combination of string parameter representations, or else you >> must dispatch on the string representation within the inner loop. On >> modern architectures, wrongly predicted conditional branches are very >> expensive. > > Keep in mind that the UTF-8 forward iterator operation has conditional > branches. Merely the act of advancing from one character to another > could take one of four paths, or more if you include the possibility > of invalid UTF-8 sequences.
Yes, but for many common operations including simple string comparison, substring search, regexp search, and parsing, there is no need to know where the character boundaries are. These algorithms can be done bytewise on UTF-8. > Well, we covered O(1) vs O(n). To make UTF-8 O(1), you need to store > additional indexing information of some sort. There are various schemes, > but, depending the the scheme, you lose some of memory advantage of UTF-8 > vs UTF-32. You can likely to better than UTF-32, though. I would prefer to either let our accessors be O(n), or else to create the index lazily, i.e. on the first usage of string-ref or string-set! In such a scheme, very few strings would include indices, and thus the overhead would be minimal. Anyway, the index overhead can be made arbitrarily small by increasing the chunk size. It is a classic time-space trade-off here. The chunk size could be made larger over the years, as usage of string-ref and string-set! become less common, and eventually the index stuff could be removed entirely. > So, one of the supposed benefits of UTF-8 strings for Guile is that you > can more easily pass information back and forth between Guile and external > libraries or programs. But in that case, when UTF-8 strings are received > by a UTF-8-enabled Guile from an external source, > you have to deal with both its O(1)-ness The indexing can be added lazily (and avoided more often than not). > the validity of its UTF-8 sequences, and the validity of the > codepoints that the UTF-8 sequences decode to (if the string is passed > back from program or library that you don't completely trust.) Yes, if we don't trust the library, then we'd have to validate the UTF-8. However, in many cases (e.g. when using libunistring) we could trust them. > So the question becomes *when* you'd deal with those questions. The best > policy would be to deal with them at the point a UTF-8 string enters your > application. When a string enters your application, you should scrub it > for invalid sequences and codepoints, throwing errors or replacing > bad characters with the Unicode replacement character and whatnot. I agree, we should make sure that all internal UTF-8 strings are valid. As you rightly point out, postponing the validation would cause many problems. > And then there is a code-correctness argument against UTF-8. It is just > too easy to confuse a string's codepoint count with its memory size, and > it is just too easy to confuse iteration over bytes with iteration over > characters. True, but there are comparable code-correctness problems with the current scheme. If we use UTF-8 uniformly, we can efficiently use external libraries such as libunistring for many tasks, and avoid writing that code ourselves. In order to make our current scheme efficient, we must not only write and import much of the code from libunistring and other libraries, but we will in many cases need to write and maintain multiple versions of those routines, to handle different combinations of string representations. The reason I am still arguing this point is because I have looked seriously at what I would need to do to (A) fix our i18n problems and (B) make the code efficient. I very much want to fix these things, but the pain of trying to do this with our current scheme is too much for me to bear. I shouldn't have to rewrite libunistring, and I shouldn't have to write 3 or 4 different variants of each procedure that takes two string parameters. Mark