Re: [lucy-dev] Iterating through CharBufs

Marvin Humphrey Thu, 21 Feb 2013 21:39:11 -0800

On Thu, Feb 21, 2013 at 4:43 AM, Nick Wellnhofer <[email protected]> wrote:
> IMO, the best way to solve these problems is to introduce string iterators.


+1

> In a basic form, they are simply the collection of a byte offset and a code
> point offset into a string.
>
>     struct CharBufIterator {
>         CharBuf *cb;
>         size_t   byte_offset;
>         size_t   code_point_offset;
>     };
>
> Useful operations on a string iterator are:
>
>     * Move the iterator forward or backward by a number of
>       code points.
>     * Get the code point at the current position or at an
>       offset relative to the current position.
>     * Get a substring between two string iterators.
>
> I used a very basic form of string iterators in my implementation of the
> StandardTokenizer:
>
>     http://s.apache.org/DCH
>
> You can also make string iterators into full-blown classes. But typically
> they're short-lived and only used as local variables inside a single
> function.

How about a three-prong strategy which uses both of those approaches and
adds one more?

*   Externally, expose iterators as full-blown, opaque objects.
*   Internally, allocate iterators on the stack using alloca() and access the
    struct members directly.
*   To accommodate highly performance-sensitive client code, provide access to
    raw string data, so that the client can operate on it using its own highly
    optimized and customized routines.

Here's some trivial sample code using CharBufIterator:

    CharBuf *hello = CB_newf("hello world");
    CharBufIterator *iterator = CB_Make_Iterator(hello);
    while (CBIter_Next(iterator)) {
        printf("%c | %d\n", CBIter_Get_Code_Point(iterator),
               (int)CBIter_Get_Byte_Offset(iterator));
    }

Returning to your earlier connection between iteration and mutability, the
fly in the ointment for exposing iteration as a public API on CharBuf is that
iterating over a mutable object like a CharBuf is an unsafe operation.
In contrast, iterating over an immutable String would be safe.

This remains a problem if we make CharBuf a subclass of String.  If safe
iteration is part of String's interface, then a mutable subclass which cannot
support safe iteration violates the Liskov Substitution Principle[1].
Arguably, a String class and a mutable character buffer serve different
purposes.

A similar issue exists when using CharBufs as hash keys: it is possible to get
at a CharBuf key and mutate it, changing its hash sum, potentially colliding
with another key and generally wreaking havoc.  Python only allows immutable
types to serve as hash keys to avoid such problems.

>> Bear in mind that one requirement for Clownfish strings going forward is to
>> support UTF-16 as an internal encoding in addition to UTF-8.
>
> Actually, you can implement every string operation in terms of string
> iterators. This concept is, for example, heavily used in the Parrot VM which
> transparently supports strings in ASCII, UTF-8, UCS-2, UTF-16, and UCS-4
> encodings. FWIW, I refactored large parts of Parrot's string subsystem in
> 2010 and I'd be happy to share my experiences.

We're fortunate to be able to draw on your experience. :)

The Clownfish core, it seems to me, has goals which are inverted in
comparison to Parrot.  While Parrot's data types have to provide exhaustive
support for all features in all target languages, the primary objective for
Clownfish is to provide least-common-denominator data types which convert to
and from host data types sensibly and with minimal resistance.  That leaves us
with a lot of freedom to pursue the secondary objective of giving the core
Clownfish data types a coherent, user-friendly programming API.

Marvin Humphrey

[1] http://en.wikipedia.org/wiki/Liskov_substitution_principle

Re: [lucy-dev] Iterating through CharBufs

Reply via email to