Re: Latin1 character problems in dispatcher

Thorsten Scherler Fri, 21 May 2010 02:03:53 -0700

On 20/05/2010, at 18:41, Sjur Moshagen wrote:

> Den 20. mai. 2010 kl. 15.26 skrev Thorsten Scherler:
> 
>> On 20/05/2010, at 14:18, Sjur Moshagen wrote:
>>>> ...
>>>> Hmm, that is weird. Please try the following:
>>>> - add a new contract that uses ñ, í and similar characters
>>>> - see what comes out
>>> 
>>> I added a blank contract that just printed the same line of characters I 
>>> used earlier for testing, and this is what came out:
>>> 
>>> This is a text containing problematic characters:
>>> a á c č d đ n ŋ s š t ŧ z ž ae æ oe ø ao å a¨ ä o¨ ö g ǥ h ħ u ʉ i ɨ
>>> 
>>> That is, the text from the contract comes through just fine, but text 
>>> coming from a standard Forrest v2 document gets garbled.
>>> 
>>> I have attached a picture of the page as it renders. The box comes from the 
>>> document, the text at the bottom is from the contract.
>> 
>> Ok I see. 
>> 
>> Please post the dataUri you use for the contract. It seems that the utf-8 is 
>> lost in this step. If you have the dataUrl of the contract see what is 
>> coming out there, whether it is already scrambled or not.
> 
> I'm not sure about how to do this, but I'll try. The dataUri used in the 
> structurer is:
> 
>          <forrest:contract name="content-main" 
>            dataURI="cocoon://#{$getRequest}.body.xml">   <-- this is the 
> dataURI
>            <forrest:property name="content-main-conf">
>              <headings type="boxed"/>
>            </forrest:property>
>          </forrest:contract>
> 
> which I take to mean:
> 
> http://localhost:8888/index.body.xml


correct, that was the uri I needed.

> 
> The text returned by that Uri is:
> 
> <?xml version="1.0" encoding="ISO-8859-1"?><div id="content"><h1>Divvun - 
> Sámi proofing tools project</h1><div id="content-main">
> 
>         <div class="note"><div class="label">UTF-8 character test</div><div 
> class="content">
>               There seems to be problems with certain characters, but only in
>               Dispatcher:<br xmlns:xi="http://www.w3.org/2001/XInclude"/>
>               a á c &#269; d &#273; n &#331; s &#353; t &#359; z &#382; ae æ 
> oe ø ao å a¨ ä o¨ ö g &#485; h &#295; u &#649; i &#616;
>         </div></div>
> 
>  </div></div>
> 
> Two things to note here:
> 
> The encoding is specified as ISO-8859-1, which is wrong,

yes should be utf8.

> and which leads to all characters outside Latin1 to be encoded as numeric 
> entities.

actually the numeric form is fine or at least should be. In my use case I take 
rss from roller and the characters coming as numeric but with utf-8 encoding.

> In the next step, this causes all non-ASCII, non-Latin1 characters to survive 
> correctly, while the Latin1 chars will be messed up when they are 
> reinterpreted as UTF-8 later - or something along these line.

Yeah, it seems the numeric form is working fine but the "native" form does not 
play nice. I wonder if we change the encoding of the *.body.xml returned doc 
whether that fixes that problem.

> 
> I don't know where the encoding comes from - everything on my end is marked 
> as UTF-8. I grepped for the string "ISO-8859-1" in the Forrest sources, and 
> got many hits, but nothing that seemed to relate to Dispatcher.

The *.body.xml comes from the dataModel.xmap:

<!-- HTML rendered from intermediate format -->
      <map:match pattern="**.body.xml">
        <map:generate src="cocoon:/{1}.source.rewritten.xml" />
        <map:transform src="{lm:dataModel-html-document-to-html.xsl}">
          <map:parameter name="path" value="{1}.html" />
        </map:transform>
        <map:serialize />
      </map:match>

The serializer here is the default one.

we define it in the xmap as

<map:serializers default="xml" />

That should read:
<map:serializers default="xml-utf8" />

I added to revision 946939 please see whether that fixes the issue. I added a 
test note to 
org.apache.forrest.plugin.internal.dispatcher/src/documentation/content/xdocs/index.xml
 so you can directly run "forrest run"  in the plugin and see the outcome.

If we done testing we should remove the debug note.

salu2

Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Re: Latin1 character problems in dispatcher

Reply via email to