Re: [Python-Dev] len(chr(i)) = 2?

Glyph Lefkowitz Tue, 23 Nov 2010 19:29:33 -0800

On Nov 23, 2010, at 9:44 PM, Stephen J. Turnbull wrote:

> James Y Knight writes:
> 
>> You put a smiley, but, in all seriousness, I think that's actually
>> the right thing to do if anyone writes a new programming
>> language. It is clearly the right thing if you don't have to be
>> concerned with backwards-compatibility: nobody really needs to be
>> able to access the Nth codepoint in a string in constant time, so
>> there's not really any point in storing a vector of codepoints.
> 
> A sad commentary on the state of Emacs usage, "nobody".
> 
> The theory is that accessing the first character of a region in a
> string often occurs as a primitive operation in O(N) or worse
> algorithms, sometimes without enough locality at the "collection of
> regions" level to give a reasonably small average access time.


I'm not sure what you mean by "the theory is".  Whose theory?  About what?

> In practice, any *Emacs user can tell you that yes, we do need to be
> able to access the Nth codepoint in a buffer in constant time.  The
> O(N) behavior of current Emacs implementations means that people often
> use a binary coding system on large files.  Yes, some position caching
> is done, but if you have a large file (eg, a mail file) which is
> virtually segmented using pointers to regions, locality gets lost.
> (This is not a design bug, this is a fundamental requirement: consider
> fast switching between threaded view and author-sorted view.)

Sounds like a design bug to me.  Personally, I'd implement "fast switching 
between threaded view and author-sorted view" the same way I'd address any 
other multiple-views-on-the-same-data problem.  I'd retain data structures for 
both, and update them as the underlying model changed.

These representations may need to maintain cursors into the underlying 
character data, if they must retain giant wads of character data as an 
underlying representation (arguably the _main_ design bug in Emacs, that it 
encourages you to do that for everything, rather than imposing a sensible 
structure), but those cursors don't need to be code-point counters; they could 
be byte offsets, or opaque handles whose precise meaning varied with the 
potentially variable underlying storage.

Also, please remember that Emacs couldn't be implemented with giant Python 
strings anyway: crucially, all of this stuff is _mutable_ in Emacs.

> And of course an operation that sorts regions in a buffer using
> character pointers will have the same problem.  Working with memory
> pointers, OTOH, sucks more than that; GNU Emacs recently bit the
> bullet and got rid of their higher-level memory-oriented APIs, all of
> the Lisp structures now work with pointers, and only the very
> low-level structures know about character-to-memory pointer
> translation.
> 
> This performance issue is perceptible even on 3GHz machines with not
> so large (50MB) mbox files.  It's *horrid* if you do something like
> "occur" on a 1GB log file, then try randomly jumping to detected log
> entries.

Case in point: "occur" needs to scan the buffer anyway; you can't do better 
than linear time there.  So you're going to iterate through the buffer, using 
one of the techniques that James proposed, and remember some locations.  Why 
not just have those locations be opaque cursors into your data?

In summary: you're right, in that James missed a spot.  You need bidirectional, 
*copyable* iterators that can traverse the string by byte, codepoint, grapheme, 
or decomposed glyph.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] len(chr(i)) = 2?

Reply via email to