On 9/20/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > "Adam Olsen" <[EMAIL PROTECTED]> wrote: > > > > On 9/20/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > > > > > "Adam Olsen" <[EMAIL PROTECTED]> wrote:
[snip token stuff] Withdrawn. Blake Winston pointed me to some problems in private as well. > If I can't slice based on character index, then we end up with a similar > situation that the wxPython StyledTextCtrl runs into right now: the > content is encoded via utf-8 internally, so users have to use the fairly > annoying PositionBefore(pos) and PositionAfter(pos) methods to discover > where characters start/end. While it is possible to handle everything > this way, it is *damn annoying*, and some users have gone so far as to > say that it *doesn't work* for Europeans. > > While I won't make the claim that it *doesn't work*, it is a pain in the > ass. I'm going to agree with you. That's also why I'm going to assume Guido meant to use Code Points, not Code Units (which would be bytes in the case of UTF-8). > > Using only utf-8 would be simpler than three distinct representations. > > And if memory usage is an issue (which it seems to be, albeit in a > > vague way), we could make a custom encoding that's even simpler and > > more space efficient than utf-8. > > One of the reasons I've been pushing for the 3 representations is > because it is (arguably) optimal for any particular string. It bothers me that adding a single character would cause it to double or quadruple in size. May be the best compromise though. > > > > * Grapheme clusters, words, lines, other groupings, do we need/want > > > > ways to slice based on them too? > > > > > > No. > > > > Can you explain your reasoning? > > We can already split based on words, lines, etc., usingsplit(), and > re.split(). Building additional functionality for text.word[4] seems to > be a waste of time. I'm not entierly convinced, but I'll leave it for now. Maybe it'll be a 3.1 feature. > The benefits gained by using the three internal representations are > primarily from a simplicity standpoint. That is to say, when > manipulating any one of the three representations, you know that the > value at offset X represents the code point of character X in the string. > > Further, with a slight change in how the single-segment buffer interface > is defined (returns the width of the character), C extensions that want > to deal with unicode strings in *native* format (due to concerns about > speed), could do so without having to worry about reencoding, > variable-width characters, etc. Is it really worthwhile if there's three different formats they'd have to handle? > You can get this same behavior by always using UTF-32 (aka UCS-4), but > at least 1/4 of the underlying data is always going to be nulls (code > points are limited to 0x0010ffff), and for many people (in Europe, the > US, and anywhere else with code points < 65536), 1/2 to 3/4 of the > underlying data is going to be nulls. > > While I would imagine that people could deal with UTF-16 as an > underlying representation (from a data waste perspective), the potential > for varying-width characters in such an encoding is a pain in the ass > (like it is for UTF-8). > > Regardless of our choice, *some platform* is going to be angry. Why? > GTK takes utf-8 encoded strings. (I don't know what Qt or linux system > calls take) Windows takes utf-16. Whatever underlying representation, > *someone* is going to have to recode when dealing with GUI or OS-level > operations. Indeed, it seems like all our options are lose-lose. Just to summarize, our requirements are: * Full unicode range (0 through 0x10FFFF) * Constant-time slicing using integer offsets * Basic unit is a Code Point * Continuous in memory The best idea I've had so far for making UTF-8 have constant-time sliving is to use a two level table, with the second level having one byte per code point. However, that brings up the minimum size to (more than) 2 bytes per code point, ruining any space advantage that utf-8 had. UTF-16 is in the same boat, but it's (more than) 3 bytes per code point. I think the only viable options (without changing the requirements) are straight UCS-4 or three-way (Latin-1/UCS-2/UCS-4). The size variability of three-way doesn't seem so important when it's only competitor is straight UCS-4. The deciding factor is what we want to expose to third-party interfaces. Sane interface (not bytes/code units), good efficiency, C-accessible: pick two. -- Adam Olsen, aka Rhamphoryncus _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
