On 21/02/2013 04:33, Marvin Humphrey wrote:
Iterating through strings seems orthogonal to mutability.  What is it that you
find objectionable about the current iteration support?

AFAICS, the current way to efficiently iterate through a string is to create a ViewCharBuf and use the ViewCB_Nip_One method, resulting in the following code:

    ViewCB_Assign(iterator, string);
    while (ViewCB_Get_Size(iterator)) {
        uint32_t code_point = ViewCB_Code_Point_At(iterator, 0);
        // Do something with code_point
        ViewCB_Nip_One(iterator);
    }

ViewCharBuf seems to be an immutable reference to a substring of a CharBuf. ViewCB_Nip_One advances the start of the substring by a single code point.

I can see the following drawbacks with this approach:

1. It's not possible to safely iterate backwards with a ViewCharBuf because the ViewCharBuf doesn't know where the original string starts. Stepping backwards from a certain position in a string is rarely needed in practice but the Highlighter is an example where exactly this operation is used.

2. It's hard to keep track of multiple positions in a string and extract a substring between two positions. These operations are primarily needed when splitting and tokenizing strings. You could remember a previous position in a string by simply copying a whole ViewCharBuf but extracting the substring between the start of two ViewCharBufs seems extremely messy to me.

IMO, the best way to solve these problems is to introduce string iterators. In a basic form, they are simply the collection of a byte offset and a code point offset into a string.

    struct CharBufIterator {
        CharBuf *cb;
        size_t   byte_offset;
        size_t   code_point_offset;
    };

Useful operations on a string iterator are:

    * Move the iterator forward or backward by a number of
      code points.
    * Get the code point at the current position or at an
      offset relative to the current position.
    * Get a substring between two string iterators.

I used a very basic form of string iterators in my implementation of the StandardTokenizer:

    http://s.apache.org/DCH

You can also make string iterators into full-blown classes. But typically they're short-lived and only used as local variables inside a single function.

Bear in mind that one requirement for Clownfish strings going forward is to
support UTF-16 as an internal encoding in addition to UTF-8.

Actually, you can implement every string operation in terms of string iterators. This concept is, for example, heavily used in the Parrot VM which transparently supports strings in ASCII, UTF-8, UCS-2, UTF-16, and UCS-4 encodings. FWIW, I refactored large parts of Parrot's string subsystem in 2010 and I'd be happy to share my experiences.

Nick

Reply via email to