Re: character sets in HTML files?

Bill Janssen Fri, 19 Oct 2001 09:57:28 -0700

> > One of the advantages of using Python 2 for parsing is that it can work
> > with a complete 32-bit Unicode charset encoding (UTF-8), rather than
> > just a locale-specific subset, and includes support for transforming
> > many (most) subsets into UTF-8.
> 
>       My understanding is that you need the catalogs and NLS support built
> into Python to take advantage of that, and that means ensuring that the
> package maintainer (or if you do source builds on your own) did not use the
> --disable-nls switch when compiling. Many do (and there's good reason to).


David, I've looked through the Python 2.0 and 2.1 sources for this
switch, and can't find it.  It's not mentioned in the README or any of
the docs, and isn't in the configure.in.  Looking at the build
sources, the unicode object isn't conditionalized in any way, so it
would be hard to build Python without it.

There is an issue about which codecs (transformers between encodings)
are installed.  By default only the codecs for the following encodings
are installed (cp* are various Windows code pages):

ascii.py                           
cp037.py                           cp1006.py
cp1026.py                          cp1250.py
cp1251.py                          cp1252.py
cp1253.py                          cp1254.py
cp1255.py                          cp1256.py
cp1257.py                          cp1258.py
cp424.py                           cp437.py
cp500.py                           cp737.py
cp775.py                           cp850.py
cp852.py                           cp855.py
cp856.py                           cp857.py
cp860.py                           cp861.py
cp862.py                           cp863.py
cp864.py                           cp865.py
cp866.py                           cp869.py
cp874.py                           cp875.py
iso8859_1.py                       iso8859_10.py
iso8859_13.py                      iso8859_14.py
iso8859_15.py                      iso8859_2.py
iso8859_3.py                       iso8859_4.py
iso8859_5.py                       iso8859_6.py
iso8859_7.py                       iso8859_8.py
iso8859_9.py                       koi8_r.py
latin_1.py                         mac_cyrillic.py
mac_greek.py                       mac_iceland.py
mac_latin2.py                      mac_roman.py
mac_turkish.py                     mbcs.py
utf_16.py                          utf_16_be.py
utf_16_le.py                       utf_8.py

The codecs for CJK are, I think, still under development, and in any
case are distributed separately from
http://sourceforge.net/projects/python-codecs.

Bill

Re: character sets in HTML files?

Reply via email to