Hi :) On Fri 10 Mar 2017 16:31, David Kastrup <[email protected]> writes:
> a) Guile already uses two different internal representations: basically > UCS-8 and UCS-32. Adding more internal representations could be done > using a number of tables indexed by the internal representation type, > making string representations sort of a "plugin". I think we probably want to avoid this if we can. We gain a number of efficiencies if we can be concrete. Of course there are counterexamples in which specialization can help, like the 20-some string kinds in V8, for example: cons strings, latin1 strings, utf-16 strings, external strings, slices, and the product of all of those; but I am hesitant to take on this cost. If we switched to UTF-8 strings, I would like to use it as our only string representation. Sure would be nice to have cons strings though! (That would give O(1) string-append.) > b) Scheme, at least older dialects, have several O(1) guarantees. R7RS seems to have relaxed this FWIW. O(1) is great of course but there are reasonable cases to be made for O(log N) being a good tradeoff if you get other benefits. > c) Indexing is the most important thing one wants to be fast. For an > utf-8 internal representation, a lot is achieved if one caches both last > index and last byte offset, preferably also string length as index and > byte length. Consider threads though :/ Caches get a bit complicated there. > d) a bad complication is write access to strings, for example with > > -- Scheme Procedure: string-map! proc s [start [end]] > -- C Function: scm_string_map_x (proc, s, start, end) TBH I wouldn't worry too much about this function in particular; you could map characters into to a vector and then write those characters back to the string. Most modern languages of course have read-only strings, and destructive operations on strings are mostly used when filling buffers. That said, this point: > The current string character can gain a longer or a shorter byte length > in this process. Is especially gnarly in the threaded case; string updates are no longer atomic. One thread mutating a string might actually corrupt another thread. Right now on x86, updates are entirely atomic; on other processors that need barriers, the worst that could happen is that another thread could fail to read an update. We'd have to re-add the string write mutex, which would be a bit sad :) > So it should provide a _large_ opportunity for the sakes of applications > with profiles akin to Emacs or LilyPond. I'm sympathetic :) Lots of details to get right (or wrong!) though. WDYT about the threading aspects? Andy
