A document has been updated: http://cocoon.zones.apache.org/daisy/documentation/1366.html
Document ID: 1366 Branch: main Language: default Name: How to configure consistent encoding in Cocoon (previously How to configure UTF-8 encoding for I18N everywhere) Document Type: Cocoon Document (unchanged) Updated on: 5/11/07 10:12:36 AM Updated by: Alexander Klimetschek A new version has been created, state: draft Parts ===== Content ------- This part has been updated. Mime type: text/xml (unchanged) File name: (unchanged) Size: 17221 bytes (previous version: 21105 bytes) Content diff: <html> <body> --- <h2 id="head-7be1dfafacbc6fb8e02d38cb177abb4a2030defc">How to configure UTF-8 --- encoding for I18N everywhere</h2> --- <p>The best for internationalization is to handle everything in UTF-8, since this is probably the most intelligent encoding available out there. Everything means server side (Backend, XML), HTTP Requests/Responses and client side with --- forms and dojo.io.bind.</p> +++ forms and dojo.io.bind. If you need another encoding, simply replace all +++ occurrences of UTF-8 with that one, but note that this guide was only tested +++ with UTF-8, other encodings might not be supported at all places.</p> <h4 id="head-b0e1772fd963c0cc72ccf58d5cada0c5797046c0">1. Sending all pages in UTF-8</h4> (28 equal lines skipped) CForms/Dojo</h4> <p>If you use CForms with ajax enabled, Cocoon will make use of dojo.io.bind() --- under the hood, which creates --- XML<a href="http://wiki.apache.org/cocoon/HttpRequests">HttpRequests</a> that --- POST the form data to the server. Here Dojo decides the encoding by default, --- which does not match the browser's behaviour of using the charset defined in the --- META tag. But you can easily tell Dojo which formatting to use for all --- dojo.io.bind() calls, just include that in the top of your HTML pages, before --- dojo.js is included:</p> +++ under the hood, which creates XMLHttpRequests that POST the form data to the +++ server. Here Dojo decides the encoding by default, which does not match the +++ browser's behaviour of using the charset defined in the META tag. But you can +++ easily tell Dojo which formatting to use for all dojo.io.bind() calls, just +++ include that in the top of your HTML pages, before dojo.js is included:</p> <pre><script>djConfig = { bindEncoding: "utf-8" };</script> </pre> (114 equal lines skipped) <a href="http://wiki.apache.org/cocoon/UseCocoonXMLSerializerCode">UseCocoonXMLSerializerCode</a> </p> --- <h3 id="head-51c043008b794ccad3f9e792e0b028ec79d95993">Older documentation</h3> +++ <h2>Further information</h2> --- <h4 id="head-1b11fc4db515f4d1e371f179c95f8b5fc78f93ac">Basics</h4> +++ <h4 id="head-1b11fc4db515f4d1e371f179c95f8b5fc78f93ac">Browser encoding basics +++ </h4> +++ <h5>Getting pages</h5> +++ <p>If your Cocoon application needs to read request parameters that could contain <em>special</em> characters, i.e. characters outside of the first 128 ASCII characters, you'll need to pay attention to what encoding is used.</p> (4 equal lines skipped) can change the encoding, but it's quite safe to assume he/she won't do that (have you ever done it?).</p> --- <p><em>In my browser this is the case, it is set in the preferences to --- ISO-8859-1 and he encodes form parameters with that, regardless of the UTF-8 --- content type of the page containing the form. I can't remember when I did set --- this property... So what to do with this case? This means, it could be any --- encoding.</em> -- --- <a href="http://wiki.apache.org/cocoon/AlexanderKlimetschek">AlexanderKlimetschek</a> --- </p> +++ <p>The browser will either read the encoding from either the <meta> tag +++ inside the HTML <head>:</p> --- <p>After doing some tests with popular browsers, I've noticed that usually --- browsers will not let the server know what encoding they used to encode the --- parameters, so we need to make sure ourselves that the encoding used when --- serializing pages corresponds to the encoding used when decoding request --- parameters.</p> --- --- <p>First of all, check in the sitemap what encoding is used when serializing --- HTML pages: <encoding>UTF-8</encoding></p> --- --- <pre><map:serializer logger="sitemap.serializer.html" mime-type="text/html" --- name="html" pool-grow="4" pool-max="32" pool-min="4" --- src="org.apache.cocoon.serialization.HTMLSerializer"> --- <buffer-size>1024</buffer-size> --- <encoding>UTF-8</encoding> --- </map:serializer> +++ <pre><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> </pre> --- <p>In the example above, UTF-8 is the encoding used. This is a widely supported --- Unicode encoding, so it is often a good choice.</p> +++ <p>or from the HTTP Header Content-Type:</p> --- <p>The HTML serializer will automatically insert a <meta> tag into the --- HTML page's HEAD element specifying the encoding. Most browsers apparently --- require this. The HTML serializer will however only do this if your page already --- contains a HEAD (or head) element, so make sure it has one. The <meta> --- element inserted by the serializer will then look as follows:</p> --- --- <pre><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> +++ <pre>Content-Type: text/html; charset=UTF-8 </pre> --- <p>Mozilla (tested with 1.4), netscape 7.1 and Internet Explorer 6 will not --- respond to the setting of this meta tag, whereas they do respond to the http --- response header "Content-Type". So you may have to subclass the HTMLSerializer --- and let it add this header in order to get Mozilla and IE working.<br/> --- -- <em>Someone added this last paragraph here. Good advice (haven't found time --- to verify it yet though), but if this is the case we should fix this in Cocoon. --- Patches welcome in bugzilla. --- (<a href="http://wiki.apache.org/cocoon/BrunoDumon">BrunoDumon</a>).</em><br/> --- -- <em>I can confirm it and the effect is obvious when using a recent Tomcat --- (> 4.1.27): --- <a href="http://issues.apache.org/bugzilla/show_bug.cgi?id=26997"><img width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/> --- Bug #26997</a>. But AFAIK the above must read 'will not respond to the setting --- of this meta tag <strong>if</strong> the encoding/charset in the "Content-Type" --- header is set' and Cocoon's problem is, that it does not set the --- encoding/charset and the recent Tomcats sets it to default ISO-8859-1. --- (<a href="http://wiki.apache.org/cocoon/JoergHeinicke">JoergHeinicke</a>)</em> --- <br/> --- -- <em>But you can make Cocoon set the header by configuring the serializer with --- the correct mime-type information: </em></p> +++ <p>One has to include both to support all browsers. This will be done by the +++ HTML serializer if you configure it with the parameters mime-type and encoding, +++ as stated above.</p> --- <ul> --- <li> --- <pre><map:serializer name="html" mime-type="text/html; charset=utf-8" --- src="org.apache.cocoon.serialization.HTMLSerializer" --- logger="sitemap.serializer.html" --- pool-grow="4" pool-max="32" pool-min="4"> --- <buffer-size>1024</buffer-size> --- <encoding>UTF-8</encoding> --- </map:serializer></pre> --- </li> --- </ul> +++ <h5>Sending form data</h5> --- <p>The first <tt>charset=utf-8</tt> is needed for the HTTP header whereas --- <tt><encoding>UTF-8</encoding></tt> seems to be responsible for the --- encoding only of the document's content. (Volkmar W. Pogatzki)</p> --- <p>By default, if the browser doesn't explicitely mention the encoding, a servlet container will decode request parameters using the ISO-8859-1 encoding (independent of the platform on which the container is running). So in the above --- case where UTF-8 was used when serializing, we would be facing problems.</p> --- --- <p><em>Note: Jetty uses --- [<a href="http://docs.codehaus.org/display/JETTY/International+Characters+and+Character+Encodings"><img width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/> --- UTF-8 as default for decoding form parameters</a>]! So you have to use the --- <tt>SetCharacterEncodingFilter</tt> (see below) to set the encoding for Jetty to --- ISO-8859-1 if this is what the browser sends.</em> --- --<a href="http://wiki.apache.org/cocoon/AlexanderKlimetschek">AlexanderKlimetschek</a> +++ case where UTF-8 was used when serializing, we would be facing problems. An +++ exception, that might hide the problem and which you will face when you use the +++ handy mvn jetty:run to run your Cocoon application, is that Jetty uses UTF-8 by +++ default. It does not adhere to the servlet container standard here. So you can +++ configure your container with the default encoding you want (e.g. UTF-8), if +++ that is possible, or you must use a solution like the +++ <a href="http://wiki.apache.org/cocoon/SetCharacterEncodingFilter">SetCharacterEncodingFilter</a>. </p> --- <p>The encoding to use when decoding request parameters can be configured in the --- web.xml by supplying init parameters called "form-encoding" and --- "container-encoding" to the Cocoon servlet. The container-encoding parameter --- indicates according to what encoding the container tried to decode the request --- parameters (normally ISO-8859-1), and the form-encoding parameter indicates the --- actual encoding. Here's an example of how to specify the parameters in the --- web.xml:</p> +++ <h4 id="head-1b11fc4db515f4d1e371f179c95f8b5fc78f93ac">Request parameter +++ encoding in Cocoon</h4> --- <pre><init-param> --- <param-name>container-encoding</param-name> --- <param-value>ISO-8859-1</param-value> --- </init-param> --- <init-param> --- <param-name>form-encoding</param-name> --- <param-value>UTF-8</param-value> --- </init-param> --- </pre> --- <p>For Java-insiders: what Cocoon actually does internally is apply the following trick to get a parameter correctly decoded: suppose "value" is a string containing a request parameter, then Cocoon will do:</p> (151 equal lines skipped)