On 21/02/2013 04:33, Marvin Humphrey wrote:
Iterating through strings seems orthogonal to mutability. What is it that you
find objectionable about the current iteration support?
AFAICS, the current way to efficiently iterate through a string is to
create a ViewCharBuf and use the ViewCB_Nip_One method, resulting in the
following code:
ViewCB_Assign(iterator, string);
while (ViewCB_Get_Size(iterator)) {
uint32_t code_point = ViewCB_Code_Point_At(iterator, 0);
// Do something with code_point
ViewCB_Nip_One(iterator);
}
ViewCharBuf seems to be an immutable reference to a substring of a
CharBuf. ViewCB_Nip_One advances the start of the substring by a single
code point.
I can see the following drawbacks with this approach:
1. It's not possible to safely iterate backwards with a ViewCharBuf
because the ViewCharBuf doesn't know where the original string starts.
Stepping backwards from a certain position in a string is rarely needed
in practice but the Highlighter is an example where exactly this
operation is used.
2. It's hard to keep track of multiple positions in a string and extract
a substring between two positions. These operations are primarily needed
when splitting and tokenizing strings. You could remember a previous
position in a string by simply copying a whole ViewCharBuf but
extracting the substring between the start of two ViewCharBufs seems
extremely messy to me.
IMO, the best way to solve these problems is to introduce string
iterators. In a basic form, they are simply the collection of a byte
offset and a code point offset into a string.
struct CharBufIterator {
CharBuf *cb;
size_t byte_offset;
size_t code_point_offset;
};
Useful operations on a string iterator are:
* Move the iterator forward or backward by a number of
code points.
* Get the code point at the current position or at an
offset relative to the current position.
* Get a substring between two string iterators.
I used a very basic form of string iterators in my implementation of the
StandardTokenizer:
http://s.apache.org/DCH
You can also make string iterators into full-blown classes. But
typically they're short-lived and only used as local variables inside a
single function.
Bear in mind that one requirement for Clownfish strings going forward is to
support UTF-16 as an internal encoding in addition to UTF-8.
Actually, you can implement every string operation in terms of string
iterators. This concept is, for example, heavily used in the Parrot VM
which transparently supports strings in ASCII, UTF-8, UCS-2, UTF-16, and
UCS-4 encodings. FWIW, I refactored large parts of Parrot's string
subsystem in 2010 and I'd be happy to share my experiences.
Nick