> S 3120 Using UTF-8 as the default source encoding von Löwis > > The basic idea seems very reasonable. I expect that the changes to the > parser may be quite significant though. Also, the parser ought to be > weened of C stdio in favor of Python's own I/O library. I wonder if > it's really possible to let the parser read the raw bytes though -- > this would seem to rule out supporting encodings like UTF-16. Somehow > I wonder if it wouldn't be easier if the parser operated on Unicode > input? That way parsing unicode strings (which we must support as all > strings will become unicode) will be simpler.
Actually, changes should be fairly minimal. The parser already transforms all input (no matter what source encoding) to UTF-8 before doing the parsing; this has worked well (as all keywords continue to be one-byte characters). The parser also already special-cases UTF-8 as the input encoding, by not putting it through a codec. That can also stay, except that it should now check that any non-ASCII bytes are well-formed UTF-8. Untangling the parser from stdio - sure. I also think it would be desirable to read the whole source into a buffer, rather than applying a line-by-line input. That might be a bigger change, making the tokenizer a multi-stage algorithm: 1. read input into a buffer 2. determine source encoding (looking at a BOM, else a declaration within the first two lines, else default to UTF-8) 3. if the source encoding is not UTF-8, pass it through a codec (decode to string, encode to UTF-8). Otherwise, check that all bytes are really well-formed UTF-8. 4. start parsing As for UTF-16: the lexer currently does not support UTF-16 as a source encoding, as we require an ASCII superset. I'm not sure whether UTF-16 needs to be supported as a source encoding, but with above changes, it would be fairly easy to support, assuming we detect UTF-16 from the BOM (can't use the encoding declaration, because that works only for ASCII supersets). Regards, Martin _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com