On Thu, Feb 21, 2013 at 4:43 AM, Nick Wellnhofer <[email protected]> wrote:
> Actually, you can implement every string operation in terms of string
> iterators. This concept is, for example, heavily used in the Parrot VM which
> transparently supports strings in ASCII, UTF-8, UCS-2, UTF-16, and UCS-4
> encodings. FWIW, I refactored large parts of Parrot's string subsystem in
> 2010 and I'd be happy to share my experiences.
I'd be interested whether you have anything to add to this argument from Tom
Christiansen as to why iteration is the best model for string processing, as
opposed to random access:
http://bugs.python.org/issue12729#msg142036
I may be wrong here, not least because I can think of possible extenuating
circumstances, but it is my impression that there there is an underlying
assumption in the Python community and many others that being able to
access the Nth character in a string in constant time for arbitrary N is
the most important of all possible considerations.
I don't believe that makes as much sense as people think, because I don't
believe character strings really are accessed in that fashion very often
at all. Sure, if you have a 2D matrix of strings where a given row-column
pair yields one character and you're taking the diagonal you might want
that, but how often does that actually happen? Virtually never: these are
strings and not matrices we're running FFTs on after all. We don't need
to be able to load them into vector registers or anything the way the
number-crunching people do.
That's because strings are a sequence of characters: they're text.
Whether reading text left to right, right to left, or even
boustrophedonically, you're always going one past the character you're
currently at. You aren't going to the Nth character forward or back for
arbitrary N. That isn't how people deal with text. Sometimes they do
look at the end, or a few in from the far end, but even that can be
handled in other fashions.
Christiansen also argues for UTF-8 as a native encoding, like Perl and Go.
Clownfish doesn't have that option -- but if we make iteration our primary
string processing model, we can avoid problems associated with random
access, such as splitting logical characters.
Marvin Humphrey