Re: [lucy-dev] CharBuf functions taking char* arguments

Marvin Humphrey Mon, 22 Apr 2013 22:07:54 -0700

On Mon, Apr 22, 2013 at 6:03 AM, Nick Wellnhofer <[email protected]> wrote:
> Maybe we should start to flesh out the design of immutable Strings.  Do you
> have a concrete plan already?


To get started, we could simply duplicate CharBuf's implementation and strip
out the mutability. :)

The unusual requirements of Clownfish Strings are that they have to wrap host
strings when bridging the host/C border.  But it seems to me that this will
always mean borrowing the host string's internal character array for use with
a stack-allocated `const ZombieString*`.

Looking forward...

There are only so many ways to implement the "immutable String class" design
pattern. :)  See Python, Ruby Symbol, various implementations of Java
String, C#, etc.

The first question is how to handle the internal buffer.  CharBufs need to own
their buffers; immutable Strings do not.

Multiple immutable String objects can share a single contiguous buffer, though
the buffer must outlive all of them.  One possible implementation is to wrap
the buffer in an object which has a refcount or otherwise fits into the GC
regime.

    String*
    Str_init_from_trusted_utf8_byte_buf(String *self, ByteBuf *buffer,
                                        size_t offset, size_t size) {
        self->buffer   = (ByteBuf*)INCREF(buffer);
        self->content  = (char*)BB_Get_Buf(buffer) + offset;
        self->size     = size;
        self->hash_sum = -1;
        return self;
    }

This is similar to typical Java String implementations:

    public class String {
        private char[] value;
        private int offset;  // location in `value` where string starts
        private int count;   // length
        private int hash;
        ...
    }

There's extra memory cost to going that route, but it buys you some
flexibility.

The second question is whether to NUL-terminate UTF-8 Strings -- and as a
corrolary, to guarantee that raw UTF-8 character data obtained from a String
will be NUL-terminated.  This is hard.  Can we guarantee that every host
string we wrap will be NUL-terminated?  I know Perl tries hard to keep string
SVs NUL-terminated, but I don't imagine that every XS module everywhere
succeeds.

The alternative is not to NUL-terminate, but to cache a NUL-terminated
C-string representation on demand.  As an optimization, we could check to see
whether the internal buffer is in fact NUL-terminated and use it if it
is.

> How should CharBufs and Strings interact?

IMO... CharBuf's primary use case should be to build Strings: after you've
manipulated the CharBuf to contain the desired character sequence, invoke
To_String() to create a new String.

It probably also makes sense to add a Yield_String() method to CharBuf which
spins off a String which steals the CharBuf's buffer and resets it to empty.

> Isn't UTF-32 used in Python (among other encodings)?

Yes, though it's not clear to me how often you'd encounter UTF-32 in the wild.

> I found only three users of the "steal" constructors:
>
>     * S_unescape_text in Lucy::Util::Json could be changed to use
>       a CharBuf and Cat_Char
>     * SkipStepper_to_string could simply use CB_newf, no?
>     * DefDocReader_fetch_doc in the C bindings could create an
>       extra copy or we could add something like InStream#ReadString.

You're right about SkipStepper.  I'd suggest using a CharBuf with
Yield_String() for the DefDocReader and Json use cases.

Marvin Humphrey

Re: [lucy-dev] CharBuf functions taking char* arguments

Reply via email to