Re: [lucy-dev] Iterating through CharBufs

Marvin Humphrey Sat, 23 Feb 2013 21:59:52 -0800

On Thu, Feb 21, 2013 at 4:43 AM, Nick Wellnhofer <[email protected]> wrote:
> Actually, you can implement every string operation in terms of string
> iterators. This concept is, for example, heavily used in the Parrot VM which
> transparently supports strings in ASCII, UTF-8, UCS-2, UTF-16, and UCS-4
> encodings. FWIW, I refactored large parts of Parrot's string subsystem in
> 2010 and I'd be happy to share my experiences.


I'd be interested whether you have anything to add to this argument from Tom
Christiansen as to why iteration is the best model for string processing, as
opposed to random access:

    http://bugs.python.org/issue12729#msg142036

    I may be wrong here, not least because I can think of possible extenuating
    circumstances, but it is my impression that there there is an underlying
    assumption in the Python community and many others that being able to
    access the Nth character in a string in constant time for arbitrary N is
    the most important of all possible considerations.

    I don't believe that makes as much sense as people think, because I don't
    believe character strings really are accessed in that fashion very often
    at all.  Sure, if you have a 2D matrix of strings where a given row-column
    pair yields one character and you're taking the diagonal you might want
    that, but how often does that actually happen?  Virtually never: these are
    strings and not matrices we're running FFTs on after all.  We don't need
    to be able to load them into vector registers or anything the way the
    number-crunching people do.

    That's because strings are a sequence of characters: they're text.
    Whether reading text left to right, right to left, or even
    boustrophedonically, you're always going one past the character you're
    currently at.  You aren't going to the Nth character forward or back for
    arbitrary N.  That isn't how people deal with text.  Sometimes they do
    look at the end, or a few in from the far end, but even that can be
    handled in other fashions.

Christiansen also argues for UTF-8 as a native encoding, like Perl and Go.
Clownfish doesn't have that option -- but if we make iteration our primary
string processing model, we can avoid problems associated with random
access, such as splitting logical characters.

Marvin Humphrey

Re: [lucy-dev] Iterating through CharBufs

Reply via email to