Alex Shinn <alexsh...@gmail.com> wrote: > On Sun, Mar 13, 2011 at 1:05 PM, Mark H Weaver <m...@netris.org> wrote: >> I just realized that it is possible to implement O(1) accessors for >> UTF-8 backed strings. > > It's possible with several approaches, but not necessarily worth it: > > http://trac.sacrideo.us/wg/wiki/StringRepresentations
Alex, can you please clarify your position? I fear that readers of your message might assume that you are against my proposal to store strings internally in UTF-8. Having read the text that you referenced above, I suspect that you are in favor of using UTF-8 with O(n) string accessors. For those who may not be familiar with the special properties of UTF-8, please read at least the section on "Common Algorithms and Usage Patterns" near the end of the text Alex referenced. In summary, many operations on UTF-8 such as substring searches, regexp searches, and parsing can be done one byte at a time, using the same inner loop that would be used for ASCII or Latin-1. Also, although it is not mentioned there, even simple string comparisons (done lexigraphically by code point) can be done byte-wise on UTF-8. I'd also like to point out that the R6RS is the only relevant standard that mandates O(1) string accessors. The R5RS did not require this, and WG1 for the R7RS has already voted against this requirement. http://trac.sacrideo.us/wg/ticket/27 I'll write more on this later. Mark