On Apr 23, 2013, at 07:01 , Marvin Humphrey <[email protected]> wrote:

> Multiple immutable String objects can share a single contiguous buffer, though
> the buffer must outlive all of them.  One possible implementation is to wrap
> the buffer in an object which has a refcount or otherwise fits into the GC
> regime.

...

> There's extra memory cost to going that route, but it buys you some
> flexibility.

I'm bit worried about memory cost. If the underlying buffer is a full-blown 
object, it will use at least three words of memory (including the object 
header). The String object itself uses at least another five. More if we cache 
things like hash sum or the length in code points. Considering additional 
malloc overhead for two objects and the buffer itself, this can easily add up 
to ~100 bytes on a 64-bit system (unless we restrict string size to 4GB). And 
all that for strings which in many cases are only 10-15 characters long!

> The second question is whether to NUL-terminate UTF-8 Strings -- and as a
> corrolary, to guarantee that raw UTF-8 character data obtained from a String
> will be NUL-terminated.  This is hard.  Can we guarantee that every host
> string we wrap will be NUL-terminated?  I know Perl tries hard to keep string
> SVs NUL-terminated, but I don't imagine that every XS module everywhere
> succeeds.

Also, if we want to support substrings with a shared buffer, it's impossible to 
NUL-terminate them.

BTW, this problem isn't restricted to UTF-8. UTF-16 strings also have to be 
NUL-terminated if we want to pass them to the Windows file system API, for 
example.

> The alternative is not to NUL-terminate, but to cache a NUL-terminated
> C-string representation on demand.  As an optimization, we could check to see
> whether the internal buffer is in fact NUL-terminated and use it if it
> is.

Or simply create a new string every time. Do we need NUL-terminated strings 
that often?

>> How should CharBufs and Strings interact?
> 
> IMO... CharBuf's primary use case should be to build Strings: after you've
> manipulated the CharBuf to contain the desired character sequence, invoke
> To_String() to create a new String.
> 
> It probably also makes sense to add a Yield_String() method to CharBuf which
> spins off a String which steals the CharBuf's buffer and resets it to empty.

It could be argued that strings should only be created via Yield_String(). 
Otherwise, strings would be mutable through the underlying buffer.

Nick

Reply via email to