Re: String internals sketch

Andy Wingo Fri, 10 Mar 2017 08:09:13 -0800

Hi :)

On Fri 10 Mar 2017 16:31, David Kastrup <[email protected]> writes:


> a) Guile already uses two different internal representations: basically
> UCS-8 and UCS-32.  Adding more internal representations could be done
> using a number of tables indexed by the internal representation type,
> making string representations sort of a "plugin".

I think we probably want to avoid this if we can.  We gain a number of
efficiencies if we can be concrete.

Of course there are counterexamples in which specialization can help,
like the 20-some string kinds in V8, for example: cons strings, latin1
strings, utf-16 strings, external strings, slices, and the product of
all of those; but I am hesitant to take on this cost.  If we switched to
UTF-8 strings, I would like to use it as our only string representation.

Sure would be nice to have cons strings though!  (That would give O(1)
string-append.)

> b) Scheme, at least older dialects, have several O(1) guarantees.

R7RS seems to have relaxed this FWIW.  O(1) is great of course but there
are reasonable cases to be made for O(log N) being a good tradeoff if
you get other benefits.

> c) Indexing is the most important thing one wants to be fast.  For an
> utf-8 internal representation, a lot is achieved if one caches both last
> index and last byte offset, preferably also string length as index and
> byte length.

Consider threads though :/ Caches get a bit complicated there.

> d) a bad complication is write access to strings, for example with
>
>  -- Scheme Procedure: string-map! proc s [start [end]]
>  -- C Function: scm_string_map_x (proc, s, start, end)

TBH I wouldn't worry too much about this function in particular; you
could map characters into to a vector and then write those characters
back to the string.  Most modern languages of course have read-only
strings, and destructive operations on strings are mostly used when
filling buffers.

That said, this point:

> The current string character can gain a longer or a shorter byte length
> in this process.

Is especially gnarly in the threaded case; string updates are no longer
atomic.  One thread mutating a string might actually corrupt another
thread.  Right now on x86, updates are entirely atomic; on other
processors that need barriers, the worst that could happen is that
another thread could fail to read an update.  We'd have to re-add the
string write mutex, which would be a bit sad :)

> So it should provide a _large_ opportunity for the sakes of applications
> with profiles akin to Emacs or LilyPond.

I'm sympathetic :) Lots of details to get right (or wrong!) though.
WDYT about the threading aspects?

Andy

Re: String internals sketch

Reply via email to