[lucy-dev] Iterating through CharBufs

Nick Wellnhofer Thu, 21 Feb 2013 04:44:41 -0800

On 21/02/2013 04:33, Marvin Humphrey wrote:

Iterating through strings seems orthogonal to mutability.  What is it that you
find objectionable about the current iteration support?

AFAICS, the current way to efficiently iterate through a string is tocreate a ViewCharBuf and use the ViewCB_Nip_One method, resulting in thefollowing code:


    ViewCB_Assign(iterator, string);
    while (ViewCB_Get_Size(iterator)) {
        uint32_t code_point = ViewCB_Code_Point_At(iterator, 0);
        // Do something with code_point
        ViewCB_Nip_One(iterator);
    }

ViewCharBuf seems to be an immutable reference to a substring of aCharBuf. ViewCB_Nip_One advances the start of the substring by a singlecode point.


I can see the following drawbacks with this approach:

1. It's not possible to safely iterate backwards with a ViewCharBufbecause the ViewCharBuf doesn't know where the original string starts.Stepping backwards from a certain position in a string is rarely neededin practice but the Highlighter is an example where exactly thisoperation is used.

2. It's hard to keep track of multiple positions in a string and extracta substring between two positions. These operations are primarily neededwhen splitting and tokenizing strings. You could remember a previousposition in a string by simply copying a whole ViewCharBuf butextracting the substring between the start of two ViewCharBufs seemsextremely messy to me.

IMO, the best way to solve these problems is to introduce stringiterators. In a basic form, they are simply the collection of a byteoffset and a code point offset into a string.


    struct CharBufIterator {
        CharBuf *cb;
        size_t   byte_offset;
        size_t   code_point_offset;
    };

Useful operations on a string iterator are:

    * Move the iterator forward or backward by a number of
      code points.
    * Get the code point at the current position or at an
      offset relative to the current position.
    * Get a substring between two string iterators.

I used a very basic form of string iterators in my implementation of theStandardTokenizer:


    http://s.apache.org/DCH

You can also make string iterators into full-blown classes. Buttypically they're short-lived and only used as local variables inside asingle function.

Bear in mind that one requirement for Clownfish strings going forward is to
support UTF-16 as an internal encoding in addition to UTF-8.

Actually, you can implement every string operation in terms of stringiterators. This concept is, for example, heavily used in the Parrot VMwhich transparently supports strings in ASCII, UTF-8, UCS-2, UTF-16, andUCS-4 encodings. FWIW, I refactored large parts of Parrot's stringsubsystem in 2010 and I'd be happy to share my experiences.


Nick

[lucy-dev] Iterating through CharBufs

Reply via email to