Re: [Python-Dev] len(chr(i)) = 2?

Glyph Lefkowitz Thu, 25 Nov 2010 23:53:35 -0800

On Nov 24, 2010, at 10:55 PM, Stephen J. Turnbull wrote:

> Greg Ewing writes:
>> On 24/11/10 22:03, Stephen J. Turnbull wrote:
>>> But
>>> if you actually need to remember positions, or regions, to jump to
>>> later or to communicate to other code that manipulates them, doing
>>> this stuff the straightforward way (just copying the whole iterator
>>> object to hang on to its state) becomes expensive.
>> 
>> If the internal representation of a text pointer (I won't call it
>> an iterator because that means something else in Python) is a byte
>> offset or something similar, it shouldn't take up any more space
>> than a Python int, which is what you'd be using anyway if you
>> represented text positions by grapheme indexes or whatever.
> 
> That's not necessarily true.  Eg, in Emacs ("there you go again"),
> Lisp integers are not only immediate (saving one pointer), but the
> type is encoded in the lower bits, so that there is no need for a type
> pointer -- the representation is smaller than the opaque marker type.
> Altogether, up to 8 of 12 bytes saved on a 32-bit platform, or 16 of
> 24 bytes on a 64-bit platform.


Yes, yes, lisp is very clever.  Maybe some other runtime, like PyPy, could make 
this optimization.  But I don't think that anyone is filling up main memory 
with gigantic piles of character indexes and need to squeeze out that extra 
couple of bytes of memory on such a tiny object.  Plus, this would allow such a 
user to stop copying the character data itself just to decode it, and on 
mostly-ascii UTF-8 text (a common use-case) this is a 2x savings right off the 
bat.

> In Python it's true that markers can use the same data structure as
> integers and simply provide different methods, and it's arguable that
> Python's design is better.  But if you use bytes internally, then you
> have problems.

No, you just have design questions.

> Do you expose that byte value to the user?

Yes, but only if they ask for it.  It's useful for computing things like quota 
and the like.

> Can users (programmers using the language and end users) specify positions in 
> terms of byte values?

Sure, why not?

> If so, what do you do if the user specifies a byte value that points into a 
> multibyte character?

Go to the beginning of the multibyte character.  Report that position; if the 
user then asks the requested marker object for its position, it will report 
that byte offset, not the originally-requested one.  (Obviously, do the same 
thing for surrogate pair code points.)

> What if the user wants to specify position by number of characters?

Part of the point that we are trying to make here is that nobody really cares 
about that use-case.  In order to know anything useful about a position in a 
text, you have to have traversed to that location in the text. You can remember 
interesting things like the offsets of starts of lines, or the x/y positions of 
characters.

> Can you translate efficiently?

No, because there's no point :).  But you _could_ implement an overlay that 
cached things like the beginning of lines, or the x/y positions of interesting 
characters.

> As I say elsewhere, it's possible that there really never is a need to 
> efficiently specify an absolute position in a large text as a character 
> (grapheme, whatever) count.

> But I think it would be hard to implement an efficient text-processing 
> *language*, eg, a Python module
> for *full conformance* in handling Unicode, on top of UTF-8.

Still: why?  I guess if I have some free time I'll try my hand at it, and maybe 
I'll run into a wall and realize you're right :).

> Any time you have an algorithm that requires efficient access to arbitrary 
> text positions, you'll spend all your skull sweat fighting the 
> representation.  At least, that's been my experience with Emacsen.

What sort of algorithm would that be, though?  The main thing that I could 
think of is a text editor trying to efficiently allow the user to scroll to the 
middle of a large file without reading the whole thing into memory.  But, in 
that case, you could use byte-positions to estimate, and display an heuristic 
number while calculating the real line numbers.  (This is what 'less' does, and 
it seems to work well.)

>> So I don't really see what you're arguing for here. How do
>> *you* think positions in unicode strings should be represented?
> 
> I think what users should see is character positions, and they should
> be able to specify them numerically as well as via an opaque marker
> object.  I don't care whether that position is represented as bytes or
> characters internally, except that the experience of Emacsen is that
> representation as byte positions is both inefficient and fragile.  The
> representation as character positions is more robust but slightly more
> inefficient.

Is it really the representation as byte positions which is fragile (i.e. the 
internal implementation detail), or the exposure of that position to calling 
code, and the idiomatic usage of that number as an integer?

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] len(chr(i)) = 2?

Reply via email to