Jim Jewett writes: > > Apart from the surrogates, are there code points that aren't > > characters?
> Yes. The BOM mark, for one. Nitpick: The BOM *is* a character (FEFF, aka ZERO-WIDTH NO-BREAK SPACE). Its byte-swapped counterpart FFFE is guaranteed *not* to be a character. (Martin wrote that correctly.) FFFF is guaranteed *not* to be a character; in fact all code points U that are equal to FFFE or FFFF modulo 0x10000 are guaranteed not to be characters (ie, the last two characters in each plane). > Plenty of other code points are reserved > for private use, or not yet assigned, Or reserved for use as surrogates, and therefore should never appear in UTF-8 or UTF-32 streams -- but if they do, AIUI they must be passed on uninterpreted unless the API explicitly says what it does with them. > or never will be assigned. There are also some that are explicitly > not characters. (U+FD00..U+FDEF), Guaranteed not to be assigned == not a character. The special range of non-characters is quite a bit smaller, FDD0..FDEF. > and some that might be debatable (unprinted control > characters, or U+FFFC: OBJECT REPLACEMENT CHARACTER) Not a good idea to classify this way. Those *are* characters, and a process may interpret them or not. Python (the language and the stdlib, except where it explicitly says otherwise) definitely should *not* worry about these things. They're characters, that's the most Python needs to know. > > Are there characters that don't have a representation as a > > single code point? (I know some characters have multiple > > representations, some of which use multiple code points.) Not a question that can be answered without reference to a specific application. An application may treat each code point as a character, or it may choose to compose code points (eg, into private space). The most Python might want to do is deal with canonical equivalence, but even then there are issues, such as the ö in the English word coördinate. I would consider the diaeresis as a separate diacritic (meaning "don't pronounce as 'oo', pronounce as 'oh-oh'), not a component of a single character. There may be even clearer examples. > There are also plenty of things that a native speaker may view as a > single character, but which unicode treats as (at most) a Named > Sequence. Eg, the New Line Function (Unicode's name for "universal newline"), which can be any of the usual suspects (CR, LF, CRLF) depending on context. _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com