Re: [lucy-dev] Initial sketch of string iterators

Nick Wellnhofer Sat, 13 Apr 2013 08:15:50 -0700

On Apr 13, 2013, at 04:01 , Marvin Humphrey <[email protected]> wrote:


> *   The next patch, "Convert CB_Trim_{Top|Tail} to use iterators", looks
>    perfect as well.  No need to implement in concrete subclasses, I see. :)

Yes, that's the idea. If the switch to iterators is complete, adding a new 
string encoding should be really easy:

    * Implement the iterator class for the encoding.
    * Derive a new string class from CharBuf which only has to implement
      the iterator factory methods and Cat_Char.

> *   Should CharBufIterator#Next return 0 when it reaches the end of a string
>    instead of throwing an exception?  That would make it possible to iterate
>    through text values which are not expected to contain NUL without having
>    to call Has_Next() before each call to Next().

This seems like a worthwhile optimization. Here's how both versions would look 
like:

    CharBufIterator *iter = CB_Top(string);
    while (CBIter_Has_Next(iter)) {
        uint32_t code_point = CBIter_Next(iter);
        ...
    }

    CharBufIterator *iter = CB_Top(string);
    uint32_t code_point;
    while (0 != (code_point = CBIter_Next(iter))) {
        ...
    }

I find the Has_Next() version more readable but the second variant isn't too 
bad. The only thing I don't like is the choice of 0 as sentinel value. I'd 
prefer -1 or any positive value beyond the Unicode planes.

After reaching the end of the string, I can see two options which both have 
their pros and cons:

    * Change to a new iterator state ("beyond string boundary") and throw
      an exception on every subsequent access (like in your example below).
    * Keep returning the sentinel value on calls to Next() but allow to
      move backward via Prev(). This can result in an infinite loop in
      faulty code.

> *   I can imagine some possibilities for optimizing the internals, but those
>    are implementation details (which I imagine you left out on purpose).

Yes, the Prev() and Next() methods should decode characters directly without 
calling StrHelp_decode_utf8_char.

> It seems to me that the factory methods Top() and Tail() should be public,
> while ZTop() and ZTail() should have parcel exposure.  More generally, I've
> moved towards the position that while we want to have the core allocate
> objects on the stack for certain internal tasks, it would be unwise at this
> point for Clownfish to support stack object allocation as a public API.
> I'm actually hoping that as we overhaul Clownfish string handling that
> ZombieCharBuf can go away.  But what you've presented makes sense and fits
> perfectly within the current context.

I think the string iterators are good candidates for "zombie" objects. They're 
small and typically only used within a single function. But +1 for not exposing 
ZombieIterators publicly.

> How about something like this, which I think should have the same cost?
> 
> -    if (self->byte_offset >= char_buf->size) {
> -        THROW(ERR, "Iteration past end of string");
> -    }
> +    if (self->byte_offset > char_buf->size) {
> +        THROW(ERR, "Iteration past end of string");
> +    }
> +    else if (self->byte_offset == char_buf->size) {

Something like ++self->byte_offset would be needed here.

> +        return 0;
> +    }

See discussion above.

>> +ZombieUTF8Iterator*
>> +ZUTF8Iter_new(void *allocation, CharBuf *char_buf, size_t byte_offset) {
>> +    ZombieUTF8Iterator *self = (ZombieUTF8Iterator*)allocation;
>> +    self->ref.count   = 1;
> 
> Might want to go through VTable_Init_Obj(), no?  ;)

Sure ;)

Nick

Re: [lucy-dev] Initial sketch of string iterators

Reply via email to