Glyph Lefkowitz writes: > But I don't think that anyone is filling up main memory with > gigantic piles of character indexes and need to squeeze out that > extra couple of bytes of memory on such a tiny object.
How do you think editors and browsers represent the regions that they highlight, then? How do you think that structure-oriented editors represent the structures that they work with, then? In a detailed analysis of a C or Java file, it's easy to end up with almost 1:2 positions to characters ratio. Note that *buffer* characters are typically smaller than a platform word, so saving one word in the representation of a position mean a 100% or more increase in the character count of the buffer. Even in the case of UCS-4 on a 32-bit platform, that's a 50% increase in the maximum usable size of a buffer before a parser starts raising OOM errors. There are two plausible ways to represent these structures that I can think of offhand. The first is to do it the way Emacs does, by reading the text into a buffer and using position offsets to map to display or structure attributes. The second is to use a hierarchical document model, and render the display by traversing the document hierarchy. It's not obvious to me that forcing use of the second representation is a good idea for performance in an editor, and I would think that they have similar memory requirements. > Plus, this would allow such a user to stop copying the character > data itself just to decode it, and on mostly-ascii UTF-8 text (a > common use-case) this is a 2x savings right off the bat. Which only matters if you're a server in the business of shoveling octets really fast but are CPU bound (seems unlikely to me, but I'm no expert; WDYT?), and even then is only that big a savings if you can push off the issue of validating the purported UTF-8 text on others. If you're not validating, you may as well acknowledge that you're processing binary data, not text.[1] But we're talking about text. And of course, if you copy mostly-Han UTF-8 text (a common use-case) to UCS-2, this is a 1.5x memory savings right off the bat, and a 3x time savings when iterating in most architectures (one increment operation per character instead of three). As I've already said, I don't think this is an argument in favor of either representation. Sometimes one wins, sometimes the other. I don't think supplying both is a great idea, although I've proposed it myself for XEmacs (but made as opaque as possible). > > In Python it's true that markers can use the same data structure as > > integers and simply provide different methods, and it's arguable that > > Python's design is better. But if you use bytes internally, then you > > have problems. > > No, you just have design questions. Call them what you like, they're as yet unanswered. In any given editing scenario, I'd concede that it's a "SMOD". But if you're designing a language for text processing, it's a restriction that I believe to be a hindrance to applications. Many applications may prefer to use a straightforward array implementation of text and focus their design efforts on the real problems of their use cases. > > Do you expose that byte value to the user? If so, what do you do > > if the user specifies a byte value that points into a multibyte > > character? > > Go to the beginning of the multibyte character. Report that > position; if the user then asks the requested marker object for its > position, it will report that byte offset, not the > originally-requested one. (Obviously, do the same thing for > surrogate pair code points.) I will guarantee that some use cases will prefer that you go to the beginning of the *next* character. For an obvious example, your algorithm will infloop if you iterate "pos += 1". (And the opposite problem appears for "beginning of next character" combined with "pos -= 1".) Of course this trivial example is easily addressed by saying "the user should be using the character iterator API here", but I expect the issue can arise where that is not an easy answer. Either the API becomes complex, or the user/developers will have to do complex bookkeeping that should be done by the text implementation. Nor is it obvious that surrogate pairs will be present in a UCS-2 representation. Specifically, they can be encoded to single private space characters in almost all applications, at a very small cost in performance. > > What if the user wants to specify position by number of > > characters? > > Part of the point that we are trying to make here is that nobody > really cares about that use-case. In order to know anything useful > about a position in a text, you have to have traversed to that > location in the text. Binary search of an ordered text is useful. Granted, this particular example can be addressed usefully in terms of byte positions (viz. your example of less), but your basic premise is falsified. > You can remember interesting things like the offsets of starts of > lines, or the x/y positions of characters. > > > Can you translate efficiently? > > No, because there's no point :). But you _could_ implement an > overlay that cached things like the beginning of lines, or the x/y > positions of interesting characters. Emacs does, and a lot of effort has gone into it, and it still sucks compared to an array representation. Maybe _you_ _could_ do better, but as yet we haven't managed to pull it off. :-( > > But I think it would be hard to implement an efficient > > text-processing *language*, eg, a Python module for *full > > conformance* in handling Unicode, on top of UTF-8. > > Still: why? I guess if I have some free time I'll try my hand at > it, and maybe I'll run into a wall and realize you're right :). I'd rather have you make it plausible to me that there's no point in having efficient access to arbitrary character positions. Then maybe you can delegate that implementation to me. :-) But my Emacs experience says otherwise, and IIUC the intuition and/or experience of MAL and Guido says this is not a YAGNI. > > Any time you have an algorithm that requires efficient access to > > arbitrary text positions, you'll spend all your skull sweat > > fighting the representation. At least, that's been my experience > > with Emacsen. > > What sort of algorithm would that be, though? The main thing that > I could think of is a text editor trying to efficiently allow the > user to scroll to the middle of a large file without reading the > whole thing into memory. Reading into memory or not is a red herring, I think. For many legacy encodings you have to pretty much read the whole thing because they are stateful, and it's just not very expensive compared to the text processing itself (unless your application is shoveling octets as fast as possible, in which case character positions are indeed a YAGNI). The question is whether opaque markers are always sufficient. For example, XEmacs does use byte positions internally for markers and extents (objects representing regions of text that can carry arbitrary properties but are tuned for display properties). Obviously, we have the marker objects you propose as sufficient, and indeed the representation is as efficient as you claim. However, these positions are not exposed as integers to end users, Lisp, or even most of the C code. If a client (end user or code) requests a position, they get a character position. Such requests are frequent enough that they constitute a major drag on many practical applications. It may be that this is unnecessary, as less shows for its application. But less is not an editor, let alone a language for writing editors. Do you know of an editor language of power comparable to Emacs Lisp that is not based on an array representation of text? > Is it really the representation as byte positions which is fragile > (i.e. the internal implementation detail), or the exposure of that > position to calling code, and the idiomatic usage of that number as > an integer? It's the latter. Sufficient effort can make it safe to use byte positions, and the effort is not all that great as long as you don't demand efficiency. XEmacs vs. Emacs implementation of Mule demonstrates that. We at XEmacs never did expose byte positions to even the C code (other than to buffer and string methods), and that implementation has not had to change much, if at all, in 15 years. The caching mechanism to make character position access reasonably efficient, however, has been buggy and not so efficient, and so complex that RMS said "I was going to implement your [position cache] in Emacs but it was too hard for me to understand". (OTOH, the alternative Emacs had implemented turned out to be O(n**2) or worse, so he had to replace it. Translating byte positions to character positions seems to be a real loser.) Emacs did expose byte positions for efficiency reasons, and has had at least four regressions of the "\201 bug". "\201" prefixes a Latin-1 character in internal code, and code that treated byte positions would often result in this being duplicated because all trailing bytes in Mule code are also Latin-1 code points. (Don't ask me about the exact mechanism, XEmacs's implementation is quite different and never suffered from this bug.) Note that a \201-like bug is very unlikely to occur in Python's UCS-2 representation because the semantics of surrogate values in Unicode is unambiguous. However, I believe similar bugs would be possible in a UTF-8 representation -- if code is allowed to choose whether to view UTF-8 in binary or text mode -- because trailing byte values are Latin-1 code points. Maybe I'm just an old granny, scared of my shadow.<wink> Footnotes: [1] I have no objection to providing "text" algorithms (such as regexps) for use on "binary" data. But then they don't provide any guarantees that transformations of purported text remains text. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com