Re: [lucy-dev] Iterating through CharBufs

Nick Wellnhofer Sun, 24 Feb 2013 07:19:20 -0800

On Feb 24, 2013, at 06:59 , Marvin Humphrey <[email protected]> wrote:


> I'd be interested whether you have anything to add to this argument from Tom
> Christiansen as to why iteration is the best model for string processing, as
> opposed to random access:
> 
>    http://bugs.python.org/issue12729#msg142036

I fully agree with Tom. Random access is only useful if you deal with 
fixed-length records which are rarely used these days.

This is a very interesting thread, BTW. It taught me some things I didn't know 
about Unicode yet. Thanks for sharing it.

Another nice thing about iterators is that if we have to support multiple 
encodings, the encoding can be abstracted behind the iterator interface. So we 
can share the implementations of String methods across encodings except for 
performance-critical stuff like Hash_Sum.

> Christiansen also argues for UTF-8 as a native encoding, like Perl and Go.
> Clownfish doesn't have that option -- but if we make iteration our primary
> string processing model, we can avoid problems associated with random
> access, such as splitting logical characters.

UTF-8 is certainly superior in almost all aspects. The fact that UTF-16 is 
still used so much has mainly historical reasons. Many implementations 
originally started out with UCS-2 and later upgraded to UTF-16 being the 
obvious but not really ideal choice. Switching from a fixed-width to a 
variable-width encoding has a lot of implications which have been overlooked in 
some programming languages as Tom Christiansen points out in the thread 
mentioned above.

Nick

Re: [lucy-dev] Iterating through CharBufs

Reply via email to