Re: character sets in HTML files?

Bill Janssen Thu, 18 Oct 2001 21:29:03 -0700

>       Remember, implementing an XML parser is no trivial matter. If the
> XML page or application fails validation, the page is bitbucketed. In the
> current scheme, Plucker tries to make sense of what's left of the broken
> HTML, but with XML, that's not allowed.


Luckily, Python 2 comes with three XML parsers.  I've been reading up on
them and trying to figure out which is the simplest to use for Plucker.

> > Indeed, but I thought XML was in unicode?  Or did I dream that? Probably
> > did, as I'm sure I've seen encoding="iso-8859-1" in some files,
> > actually.
> 
>       It is indeed unicode, however, you can override it.

There are two things going on.

Every XML and/or HTML document allows the full Unicode character set.
Period.  Every HTML document can contain any Unicode character.  But
they are expressed differently in the document depending on what
charset encoding is being used.  If a simple encoding like US-ASCII is
used, characters not in that character set are expressed as "&#dddd;",
where dddd is the decimal value for the Unicode character code.
That's why you sometimes see things like "&#8212;" (em-dash) in HTML
files.

See http://www.w3.org/TR/2000/REC-xml-20001006#charsets for XML, and
http://www.w3.org/TR/html4/charset.html#h-5.1 for HTML, for more on
all this.

One of the practical consequences of all this is that when you receive
an HTML document, for example, in UTF-16LE or ISO-8859-5 charset
encoding, you need to transform it to a local charset encoding (say
US-ASCII or ISO-9959-1) before you can even parse it.  One of the
advantages of using Python 2 for parsing is that it can work with a
complete 32-bit Unicode charset encoding (UTF-8), rather than just a
locale-specific subset, and includes support for transforming many
(most) subsets into UTF-8.

Bill

Re: character sets in HTML files?

Reply via email to