Re: Using libunistring for string comparisons et al

Mark H Weaver Tue, 15 Mar 2011 16:13:22 -0700

Mike Gran <spk...@yahoo.com> writes:
>> From:Mark H Weaver <m...@netris.org>
>> Despite the similarity of these two representations, they are
>> sufficiently different that they cannot be handled by the same machine
>> code.  That means you must either implement multiple inner loops, one
>> for each combination of string parameter representations, or else you
>> must dispatch on the string representation within the inner loop.  On
>> modern architectures, wrongly predicted conditional branches are very
>> expensive.
>
> Keep in mind that the UTF-8 forward iterator operation has conditional
> branches.  Merely the act of advancing from one character to another
> could take one of four paths, or more if you include the possibility
> of invalid UTF-8 sequences.


Yes, but for many common operations including simple string comparison,
substring search, regexp search, and parsing, there is no need to know
where the character boundaries are.  These algorithms can be done
bytewise on UTF-8.

> Well, we covered O(1) vs O(n).  To make UTF-8 O(1), you need to store
> additional indexing information of some sort.  There are various schemes,
> but, depending the the scheme, you lose some of memory advantage of UTF-8
> vs UTF-32.  You can likely to better than UTF-32, though.

I would prefer to either let our accessors be O(n), or else to create
the index lazily, i.e. on the first usage of string-ref or string-set!
In such a scheme, very few strings would include indices, and thus the
overhead would be minimal.

Anyway, the index overhead can be made arbitrarily small by increasing
the chunk size.  It is a classic time-space trade-off here.  The chunk
size could be made larger over the years, as usage of string-ref and
string-set! become less common, and eventually the index stuff could be
removed entirely.

> So, one of the supposed benefits of UTF-8 strings for Guile is that you
> can more easily pass information back and forth between Guile and external
> libraries or programs.  But in that case, when UTF-8 strings are received
> by a UTF-8-enabled Guile from an external source,
> you have to deal with both its O(1)-ness

The indexing can be added lazily (and avoided more often than not).

> the validity of its UTF-8 sequences, and the validity of the
> codepoints that the UTF-8 sequences decode to (if the string is passed
> back from program or library that you don't completely trust.)

Yes, if we don't trust the library, then we'd have to validate the
UTF-8.  However, in many cases (e.g. when using libunistring) we could
trust them.

> So the question becomes *when* you'd deal with those questions.  The best
> policy would be to deal with them at the point a UTF-8 string enters your
> application.  When a string enters your application, you should scrub it
> for invalid sequences and codepoints, throwing errors or replacing
> bad characters with the Unicode replacement character and whatnot.

I agree, we should make sure that all internal UTF-8 strings are valid.
As you rightly point out, postponing the validation would cause many
problems.

> And then there is a code-correctness argument against UTF-8.  It is just
> too easy to confuse a string's codepoint count with its memory size, and
> it is just too easy to confuse iteration over bytes with iteration over
> characters.

True, but there are comparable code-correctness problems with the
current scheme.  If we use UTF-8 uniformly, we can efficiently use
external libraries such as libunistring for many tasks, and avoid
writing that code ourselves.

In order to make our current scheme efficient, we must not only write
and import much of the code from libunistring and other libraries, but
we will in many cases need to write and maintain multiple versions of
those routines, to handle different combinations of string
representations.

The reason I am still arguing this point is because I have looked
seriously at what I would need to do to (A) fix our i18n problems and
(B) make the code efficient.  I very much want to fix these things,
but the pain of trying to do this with our current scheme is too much
for me to bear.  I shouldn't have to rewrite libunistring, and I
shouldn't have to write 3 or 4 different variants of each procedure
that takes two string parameters.

      Mark

Re: Using libunistring for string comparisons et al

Reply via email to