Re: O(1) accessors for UTF-8 backed strings

Mark H Weaver Tue, 15 Mar 2011 08:51:47 -0700

Alex Shinn <alexsh...@gmail.com> wrote:
> On Sun, Mar 13, 2011 at 1:05 PM, Mark H Weaver <m...@netris.org> wrote:
>> I just realized that it is possible to implement O(1) accessors for
>> UTF-8 backed strings.
>
> It's possible with several approaches, but not necessarily worth it:
>
> http://trac.sacrideo.us/wg/wiki/StringRepresentations


Alex, can you please clarify your position?  I fear that readers of your
message might assume that you are against my proposal to store strings
internally in UTF-8.  Having read the text that you referenced above, I
suspect that you are in favor of using UTF-8 with O(n) string accessors.

For those who may not be familiar with the special properties of UTF-8,
please read at least the section on "Common Algorithms and Usage
Patterns" near the end of the text Alex referenced.  In summary, many
operations on UTF-8 such as substring searches, regexp searches, and
parsing can be done one byte at a time, using the same inner loop that
would be used for ASCII or Latin-1.  Also, although it is not mentioned
there, even simple string comparisons (done lexigraphically by code
point) can be done byte-wise on UTF-8.

I'd also like to point out that the R6RS is the only relevant standard
that mandates O(1) string accessors.  The R5RS did not require this, and
WG1 for the R7RS has already voted against this requirement.

  http://trac.sacrideo.us/wg/ticket/27

I'll write more on this later.

    Mark

Re: O(1) accessors for UTF-8 backed strings

Reply via email to