On Mon, Apr 22, 2013 at 6:03 AM, Nick Wellnhofer <[email protected]> wrote:
> Maybe we should start to flesh out the design of immutable Strings. Do you
> have a concrete plan already?
To get started, we could simply duplicate CharBuf's implementation and strip
out the mutability. :)
The unusual requirements of Clownfish Strings are that they have to wrap host
strings when bridging the host/C border. But it seems to me that this will
always mean borrowing the host string's internal character array for use with
a stack-allocated `const ZombieString*`.
Looking forward...
There are only so many ways to implement the "immutable String class" design
pattern. :) See Python, Ruby Symbol, various implementations of Java
String, C#, etc.
The first question is how to handle the internal buffer. CharBufs need to own
their buffers; immutable Strings do not.
Multiple immutable String objects can share a single contiguous buffer, though
the buffer must outlive all of them. One possible implementation is to wrap
the buffer in an object which has a refcount or otherwise fits into the GC
regime.
String*
Str_init_from_trusted_utf8_byte_buf(String *self, ByteBuf *buffer,
size_t offset, size_t size) {
self->buffer = (ByteBuf*)INCREF(buffer);
self->content = (char*)BB_Get_Buf(buffer) + offset;
self->size = size;
self->hash_sum = -1;
return self;
}
This is similar to typical Java String implementations:
public class String {
private char[] value;
private int offset; // location in `value` where string starts
private int count; // length
private int hash;
...
}
There's extra memory cost to going that route, but it buys you some
flexibility.
The second question is whether to NUL-terminate UTF-8 Strings -- and as a
corrolary, to guarantee that raw UTF-8 character data obtained from a String
will be NUL-terminated. This is hard. Can we guarantee that every host
string we wrap will be NUL-terminated? I know Perl tries hard to keep string
SVs NUL-terminated, but I don't imagine that every XS module everywhere
succeeds.
The alternative is not to NUL-terminate, but to cache a NUL-terminated
C-string representation on demand. As an optimization, we could check to see
whether the internal buffer is in fact NUL-terminated and use it if it
is.
> How should CharBufs and Strings interact?
IMO... CharBuf's primary use case should be to build Strings: after you've
manipulated the CharBuf to contain the desired character sequence, invoke
To_String() to create a new String.
It probably also makes sense to add a Yield_String() method to CharBuf which
spins off a String which steals the CharBuf's buffer and resets it to empty.
> Isn't UTF-32 used in Python (among other encodings)?
Yes, though it's not clear to me how often you'd encounter UTF-32 in the wild.
> I found only three users of the "steal" constructors:
>
> * S_unescape_text in Lucy::Util::Json could be changed to use
> a CharBuf and Cat_Char
> * SkipStepper_to_string could simply use CB_newf, no?
> * DefDocReader_fetch_doc in the C bindings could create an
> extra copy or we could add something like InStream#ReadString.
You're right about SkipStepper. I'd suggest using a CharBuf with
Yield_String() for the DefDocReader and Json use cases.
Marvin Humphrey