On Apr 22, 2013, at 22:01, Marvin Humphrey <[email protected]> wrote:
> On Mon, Apr 22, 2013 at 6:03 AM, Nick Wellnhofer <[email protected]> wrote: >> Maybe we should start to flesh out the design of immutable Strings. Do you >> have a concrete plan already? > > To get started, we could simply duplicate CharBuf's implementation and strip > out the mutability. :) > > The unusual requirements of Clownfish Strings are that they have to wrap host > strings when bridging the host/C border. But it seems to me that this will > always mean borrowing the host string's internal character array for use with > a stack-allocated `const ZombieString*`. > > Looking forward... > > There are only so many ways to implement the "immutable String class" design > pattern. :) See Python, Ruby Symbol, various implementations of Java > String, C#, etc. > > The first question is how to handle the internal buffer. CharBufs need to own > their buffers; immutable Strings do not. > > Multiple immutable String objects can share a single contiguous buffer, though > the buffer must outlive all of them. One possible implementation is to wrap > the buffer in an object which has a refcount or otherwise fits into the GC > regime. > > String* > Str_init_from_trusted_utf8_byte_buf(String *self, ByteBuf *buffer, > size_t offset, size_t size) { > self->buffer = (ByteBuf*)INCREF(buffer); > self->content = (char*)BB_Get_Buf(buffer) + offset; > self->size = size; > self->hash_sum = -1; > return self; > } > > This is similar to typical Java String implementations: > > public class String { > private char[] value; > private int offset; // location in `value` where string starts > private int count; // length > private int hash; > ... > } > > There's extra memory cost to going that route, but it buys you some > flexibility. > > The second question is whether to NUL-terminate UTF-8 Strings -- and as a > corrolary, to guarantee that raw UTF-8 character data obtained from a String > will be NUL-terminated. This is hard. Can we guarantee that every host > string we wrap will be NUL-terminated? I know Perl tries hard to keep string > SVs NUL-terminated, but I don't imagine that every XS module everywhere > succeeds. > > The alternative is not to NUL-terminate, but to cache a NUL-terminated > C-string representation on demand. As an optimization, we could check to see > whether the internal buffer is in fact NUL-terminated and use it if it > is. > >> How should CharBufs and Strings interact? > > IMO... CharBuf's primary use case should be to build Strings: after you've > manipulated the CharBuf to contain the desired character sequence, invoke > To_String() to create a new String. > > It probably also makes sense to add a Yield_String() method to CharBuf which > spins off a String which steals the CharBuf's buffer and resets it to empty. > >> Isn't UTF-32 used in Python (among other encodings)? Python is moving to a model where a string could be in any UTF width, based on its characters: http://www.python.org/dev/peps/pep-0393/ Andi.. > > Yes, though it's not clear to me how often you'd encounter UTF-32 in the wild. > >> I found only three users of the "steal" constructors: >> >> * S_unescape_text in Lucy::Util::Json could be changed to use >> a CharBuf and Cat_Char >> * SkipStepper_to_string could simply use CB_newf, no? >> * DefDocReader_fetch_doc in the C bindings could create an >> extra copy or we could add something like InStream#ReadString. > > You're right about SkipStepper. I'd suggest using a CharBuf with > Yield_String() for the DefDocReader and Json use cases. > > Marvin Humphrey
