Re: [Python-Dev] len(chr(i)) = 2?

Stephen J. Turnbull Sat, 27 Nov 2010 00:50:44 -0800

Glyph Lefkowitz writes:

 > But I don't think that anyone is filling up main memory with
 > gigantic piles of character indexes and need to squeeze out that
 > extra couple of bytes of memory on such a tiny object.


How do you think editors and browsers represent the regions that they
highlight, then?  How do you think that structure-oriented editors
represent the structures that they work with, then?  In a detailed
analysis of a C or Java file, it's easy to end up with almost 1:2
positions to characters ratio.  Note that *buffer* characters are
typically smaller than a platform word, so saving one word in the
representation of a position mean a 100% or more increase in the
character count of the buffer.  Even in the case of UCS-4 on a 32-bit
platform, that's a 50% increase in the maximum usable size of a
buffer before a parser starts raising OOM errors.

There are two plausible ways to represent these structures that I can
think of offhand.  The first is to do it the way Emacs does, by
reading the text into a buffer and using position offsets to map to
display or structure attributes.  The second is to use a hierarchical
document model, and render the display by traversing the document
hierarchy.

It's not obvious to me that forcing use of the second representation
is a good idea for performance in an editor, and I would think that
they have similar memory requirements.

 > Plus, this would allow such a user to stop copying the character
 > data itself just to decode it, and on mostly-ascii UTF-8 text (a
 > common use-case) this is a 2x savings right off the bat.

Which only matters if you're a server in the business of shoveling
octets really fast but are CPU bound (seems unlikely to me, but I'm no
expert; WDYT?), and even then is only that big a savings if you can
push off the issue of validating the purported UTF-8 text on others.
If you're not validating, you may as well acknowledge that you're
processing binary data, not text.[1]  But we're talking about text.

And of course, if you copy mostly-Han UTF-8 text (a common use-case)
to UCS-2, this is a 1.5x memory savings right off the bat, and a 3x
time savings when iterating in most architectures (one increment
operation per character instead of three).

As I've already said, I don't think this is an argument in favor of
either representation.  Sometimes one wins, sometimes the other.  I
don't think supplying both is a great idea, although I've proposed it
myself for XEmacs (but made as opaque as possible).

 > > In Python it's true that markers can use the same data structure as
 > > integers and simply provide different methods, and it's arguable that
 > > Python's design is better.  But if you use bytes internally, then you
 > > have problems.
 > 
 > No, you just have design questions.

Call them what you like, they're as yet unanswered.  In any given
editing scenario, I'd concede that it's a "SMOD".  But if you're
designing a language for text processing, it's a restriction that I
believe to be a hindrance to applications.  Many applications may
prefer to use a straightforward array implementation of text and focus
their design efforts on the real problems of their use cases.

 > > Do you expose that byte value to the user?  If so, what do you do
 > > if the user specifies a byte value that points into a multibyte
 > > character?
 > 
 > Go to the beginning of the multibyte character.  Report that
 > position; if the user then asks the requested marker object for its
 > position, it will report that byte offset, not the
 > originally-requested one.  (Obviously, do the same thing for
 > surrogate pair code points.)

I will guarantee that some use cases will prefer that you go to the
beginning of the *next* character.  For an obvious example, your
algorithm will infloop if you iterate "pos += 1".  (And the opposite
problem appears for "beginning of next character" combined with
"pos -= 1".)  Of course this trivial example is easily addressed by
saying "the user should be using the character iterator API here", but
I expect the issue can arise where that is not an easy answer.  Either
the API becomes complex, or the user/developers will have to do
complex bookkeeping that should be done by the text implementation.

Nor is it obvious that surrogate pairs will be present in a UCS-2
representation.  Specifically, they can be encoded to single private
space characters in almost all applications, at a very small cost in
performance.

 > > What if the user wants to specify position by number of
 > > characters?
 > 
 > Part of the point that we are trying to make here is that nobody
 > really cares about that use-case.  In order to know anything useful
 > about a position in a text, you have to have traversed to that
 > location in the text.

Binary search of an ordered text is useful.  Granted, this
particular example can be addressed usefully in terms of byte
positions (viz. your example of less), but your basic premise is
falsified.

 > You can remember interesting things like the offsets of starts of
 > lines, or the x/y positions of characters.
 > 
 > > Can you translate efficiently?
 > 
 > No, because there's no point :).  But you _could_ implement an
 > overlay that cached things like the beginning of lines, or the x/y
 > positions of interesting characters.

Emacs does, and a lot of effort has gone into it, and it still sucks
compared to an array representation.  Maybe _you_ _could_ do better,
but as yet we haven't managed to pull it off. :-(

 > > But I think it would be hard to implement an efficient
 > > text-processing *language*, eg, a Python module for *full
 > > conformance* in handling Unicode, on top of UTF-8.
 > 
 > Still: why?  I guess if I have some free time I'll try my hand at
 > it, and maybe I'll run into a wall and realize you're right :).

I'd rather have you make it plausible to me that there's no point in
having efficient access to arbitrary character positions.  Then maybe
you can delegate that implementation to me. :-)  But my Emacs
experience says otherwise, and IIUC the intuition and/or experience of
MAL and Guido says this is not a YAGNI.

 > > Any time you have an algorithm that requires efficient access to
 > > arbitrary text positions, you'll spend all your skull sweat
 > > fighting the representation.  At least, that's been my experience
 > > with Emacsen.
 > 
 > What sort of algorithm would that be, though?  The main thing that
 > I could think of is a text editor trying to efficiently allow the
 > user to scroll to the middle of a large file without reading the
 > whole thing into memory.

Reading into memory or not is a red herring, I think.  For many legacy
encodings you have to pretty much read the whole thing because they
are stateful, and it's just not very expensive compared to the text
processing itself (unless your application is shoveling octets as fast
as possible, in which case character positions are indeed a YAGNI).

The question is whether opaque markers are always sufficient.  For
example, XEmacs does use byte positions internally for markers and
extents (objects representing regions of text that can carry arbitrary
properties but are tuned for display properties).  Obviously, we have
the marker objects you propose as sufficient, and indeed the
representation is as efficient as you claim.  However, these positions
are not exposed as integers to end users, Lisp, or even most of the C
code.  If a client (end user or code) requests a position, they get a
character position.

Such requests are frequent enough that they constitute a major drag on
many practical applications.  It may be that this is unnecessary, as
less shows for its application.  But less is not an editor, let alone
a language for writing editors.

Do you know of an editor language of power comparable to Emacs Lisp
that is not based on an array representation of text?

 > Is it really the representation as byte positions which is fragile
 > (i.e. the internal implementation detail), or the exposure of that
 > position to calling code, and the idiomatic usage of that number as
 > an integer?

It's the latter.  Sufficient effort can make it safe to use byte
positions, and the effort is not all that great as long as you don't
demand efficiency.  XEmacs vs. Emacs implementation of Mule
demonstrates that.

We at XEmacs never did expose byte positions to even the C code (other
than to buffer and string methods), and that implementation has not
had to change much, if at all, in 15 years.  The caching mechanism to
make character position access reasonably efficient, however, has been
buggy and not so efficient, and so complex that RMS said "I was going
to implement your [position cache] in Emacs but it was too hard for me
to understand".  (OTOH, the alternative Emacs had implemented turned
out to be O(n**2) or worse, so he had to replace it.  Translating byte
positions to character positions seems to be a real loser.)

Emacs did expose byte positions for efficiency reasons, and has had at
least four regressions of the "\201 bug".  "\201" prefixes a Latin-1
character in internal code, and code that treated byte positions would
often result in this being duplicated because all trailing bytes in
Mule code are also Latin-1 code points.  (Don't ask me about the exact
mechanism, XEmacs's implementation is quite different and never
suffered from this bug.)

Note that a \201-like bug is very unlikely to occur in Python's UCS-2
representation because the semantics of surrogate values in Unicode is
unambiguous.  However, I believe similar bugs would be possible in a
UTF-8 representation -- if code is allowed to choose whether to view
UTF-8 in binary or text mode -- because trailing byte values are
Latin-1 code points.  Maybe I'm just an old granny, scared of my
shadow.<wink>

Footnotes: 
[1]  I have no objection to providing "text" algorithms (such as
regexps) for use on "binary" data.  But then they don't provide any
guarantees that transformations of purported text remains text.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] len(chr(i)) = 2?

Reply via email to