Encoding problems

Stefano Mazzocchi Tue, 11 Mar 2003 03:46:57 -0800

The new webapp welcome page contains a copyright character which is not encoded as the default HTML entity © or the usual &xxx; char, but it's directly copied in the proper encoding.

The offending char is contained in the welcome.xslt stylesheet that is encoded as ISO-8859-1.

The pipeline does

 - welcome.xml -> ISO-8859-1
 - welcome.xslt -> ISO-8859-1
 - xhtml serializer -> UTF-8

the results are indeed encoded using UTF-8, thus the copyright sign ends up being 16 bits (UTF-8 is a clever mixing of 8bit and 16bit char encoding that was done for easy back compatibility and compression since most text is on the lower 8bit spectrum nowadays, UTF-16 is more even in that respect, but nobody uses it because text is normally half as big)

On MacOSX, the results are interesting:

 - mozilla 1.3b (20030212) displays the correct encoding
 - safari 1.0b(v60) doesn't
 - camino 0.7 (2003030613) displays the correct encoding
 - IE 5.2.2 (5010.1) doesn't

I traced the problem down to the fact that, apparently, both IE and Safari are *NOT* able to understand the encoding from the starting XML PI.

On the other hand, by placing

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

the server creates an HTTP header that instructs the user-agent about the encoding. This solved the encoding problem on *all* browsers.

Results:

1) this is *NOT* a cocoon issue 2) be aware of the fact that some user-agents do not parse the XML PI to get the encoding, but only the HTTP headers.

NOTES: 1) there is no clear indication on the XHTML specification about how user-agents have to guess the encoding 2) there is no indication on what Mime-type the XHTML content should have.

These problems reflect the lack of direct collaboration between the IETF and W3C on XML/HTTP relationship. Unfortunately, this is only going to get worse. So be prepared, expecially for severely internationalized content.

Stefano.

Encoding problems

Reply via email to