Note that I'm not saying that there shouldn't be a UTF-8 string type; I'm just saying that for some purposes it might be a good idea to keep UTF-16 and UTF-32 string types around.
Glyph Lefkowitz writes: > > The theory is that accessing the first character of a region in a > > string often occurs as a primitive operation in O(N) or worse > > algorithms, sometimes without enough locality at the "collection of > > regions" level to give a reasonably small average access time. > > I'm not sure what you mean by "the theory is". Whose theory? About what? Mine. About why somebody somewhere someday would need fast random access to character positions. "Nobody ever needs that" is a strong claim. > > In practice, any *Emacs user can tell you that yes, we do need to be > > able to access the Nth codepoint in a buffer in constant time. The > > O(N) behavior of current Emacs implementations means that people often > > use a binary coding system on large files. Yes, some position caching > > is done, but if you have a large file (eg, a mail file) which is > > virtually segmented using pointers to regions, locality gets lost. > > (This is not a design bug, this is a fundamental requirement: consider > > fast switching between threaded view and author-sorted view.) > > Sounds like a design bug to me. Personally, I'd implement "fast > switching between threaded view and author-sorted view" the same > way I'd address any other multiple-views-on-the-same-data problem. > I'd retain data structures for both, and update them as the > underlying model changed. Um, that's precisely the design I'm talking about. But as you recognize later, the message content is not part of those structures because there's no real point in copying it *if you have fast access to character positions*. In a variable width character, character- addressed design, there can be a perceptible delay in accessing even the "next" message's content if you're in the wrong view. > These representations may need to maintain cursors into the > underlying character data, if they must retain giant wads of > character data as an underlying representation (arguably the _main_ > design bug in Emacs, that it encourages you to do that for > everything, rather than imposing a sensible structure), but those > cursors don't need to be code-point counters; they could be byte > offsets, or opaque handles whose precise meaning varied with the > potentially variable underlying storage. Both byte offsets and opaque handles really really suck to design, implement, and maintain, if Lisp or Python level users can use them. They're hard enough to do when you can hide them behind internal APIs, but if they're accessible to users they're an endless source of user bugs. What was that you were saying about the difficulty of remembering which argument is the fd? It's like that. Sure, you can design APIs to help get that right, but it's not easy to provide one that can be used for all the different applications out there. > Also, please remember that Emacs couldn't be implemented with giant > Python strings anyway: crucially, all of this stuff is _mutable_ in > Emacs. No, that's a red herring. The use-cases where Emacs users complain most is browsing giant logs and reading old mail; neither needs the content to be mutable (although of course it's a convenience in the mail case if you delete messages or fetch new mail, but that could be done with transaction logs that get appended to the on-disk file). > Case in point: "occur" needs to scan the buffer anyway; you can't > do better than linear time there. So you're going to iterate > through the buffer, using one of the techniques that James > proposed, and remember some locations. Why not just have those > locations be opaque cursors into your data? They are. But unless you're willing to implement correct character motion, they need to be character indicies, which will be slow to access the actual locations. We've implemented caches, as does Emacs, but they don't always get hits. Finding an arbitrary position once can involve perceptible delay on up to 1GHz machines; doing it in a loop (which mail programs have a habit of doing) could be very painful. > In summary: you're right, in that James missed a spot. You need > bidirectional, *copyable* iterators that can traverse the string by > byte, codepoint, grapheme, or decomposed glyph. That's a good start, yes. But once you talk about "remembering some locations", you're implicitly talking about random access. Either you maintain position indexes which naively implemented can easily be close to the size of the text buffer (indexes are going to be at least 4 bytes, possibly 8, per position, and something like "occur" can generate a lot of positions) -- in which case you might as well just use a representation that is an array in the first place -- or you need to implement a position cache which can be very hairy to do well. Or you can give user programs memory indicies, and enjoy the fun as the poor developers do things like "pos += 1" which works fine on the ASCII data they have lying around, then wonder why they get Unicode errors when they take substrings. I'm sure it all can be done, but I don't think it will be done right the first time around. You may be right that designs better adapted to large data sets than Emacs's "everything is a big buffer" will almost always be available with reasonable effort. But remember, a lot of good applications start small, when a flat array might make lots of sense as the underlying structure, and then need to scale. If you need to scale for the paying customers, well, "ouch!" but you can afford it, but for many volunteer or startup projects that takes the wind right out of your sails. Note that if the user doesn't use private space, in a UCS-2 build you have about 1.5K code points available for compressing non-BMP characters into a 2-byte, valid Unicode representation (of course you need to save off the table somewhere if that ever gets out of your program, but that's easy). I find it hard to imagine that there will be many use-cases that need more than that many non-BMP characters. So probably you can tell those few users who care to use a UCS-4 build; most of the array use-cases can be served by UCS-2. Note that in my Japanese corpuses, UTF-8 averages just about 2 bytes per character anyway, and those are mail files, where two lines of Japanese may be preceded by 2KB of ASCII-only header. I suspect Hebrew, Arabic, and Cyrillic users will have similar experiences. By the way, to send the ball back into your court, I have this feeling that the demand for UTF-8 is once again driven by native English speakers who are very shortly going to find themselves, and the data they are most familiar with, very much in the minority. Of course the market that benefits from UTF-8 compression will remain very large for the immediate future, but in the grand scheme of things, most of the world is going to prefer UTF-16 by a substantial margin. N.B. I'm not talking about persistent storage, where it's 6 of one and half a dozen of the other; you can translate UTF-8 to UTF-16 way faster than you can read content from disk, of course. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com