On 20/05/2010, at 18:41, Sjur Moshagen wrote:
> Den 20. mai. 2010 kl. 15.26 skrev Thorsten Scherler:
>
>> On 20/05/2010, at 14:18, Sjur Moshagen wrote:
>>>> ...
>>>> Hmm, that is weird. Please try the following:
>>>> - add a new contract that uses ñ, í and similar characters
>>>> - see what comes out
>>>
>>> I added a blank contract that just printed the same line of characters I
>>> used earlier for testing, and this is what came out:
>>>
>>> This is a text containing problematic characters:
>>> a á c č d đ n ŋ s š t ŧ z ž ae æ oe ø ao å a¨ ä o¨ ö g ǥ h ħ u ʉ i ɨ
>>>
>>> That is, the text from the contract comes through just fine, but text
>>> coming from a standard Forrest v2 document gets garbled.
>>>
>>> I have attached a picture of the page as it renders. The box comes from the
>>> document, the text at the bottom is from the contract.
>>
>> Ok I see.
>>
>> Please post the dataUri you use for the contract. It seems that the utf-8 is
>> lost in this step. If you have the dataUrl of the contract see what is
>> coming out there, whether it is already scrambled or not.
>
> I'm not sure about how to do this, but I'll try. The dataUri used in the
> structurer is:
>
> <forrest:contract name="content-main"
> dataURI="cocoon://#{$getRequest}.body.xml"> <-- this is the
> dataURI
> <forrest:property name="content-main-conf">
> <headings type="boxed"/>
> </forrest:property>
> </forrest:contract>
>
> which I take to mean:
>
> http://localhost:8888/index.body.xml
correct, that was the uri I needed.
>
> The text returned by that Uri is:
>
> <?xml version="1.0" encoding="ISO-8859-1"?><div id="content"><h1>Divvun -
> Sámi proofing tools project</h1><div id="content-main">
>
> <div class="note"><div class="label">UTF-8 character test</div><div
> class="content">
> There seems to be problems with certain characters, but only in
> Dispatcher:<br xmlns:xi="http://www.w3.org/2001/XInclude"/>
> a á c č d đ n ŋ s š t ŧ z ž ae æ
> oe ø ao å a¨ ä o¨ ö g ǥ h ħ u ʉ i ɨ
> </div></div>
>
> </div></div>
>
> Two things to note here:
>
> The encoding is specified as ISO-8859-1, which is wrong,
yes should be utf8.
> and which leads to all characters outside Latin1 to be encoded as numeric
> entities.
actually the numeric form is fine or at least should be. In my use case I take
rss from roller and the characters coming as numeric but with utf-8 encoding.
> In the next step, this causes all non-ASCII, non-Latin1 characters to survive
> correctly, while the Latin1 chars will be messed up when they are
> reinterpreted as UTF-8 later - or something along these line.
Yeah, it seems the numeric form is working fine but the "native" form does not
play nice. I wonder if we change the encoding of the *.body.xml returned doc
whether that fixes that problem.
>
> I don't know where the encoding comes from - everything on my end is marked
> as UTF-8. I grepped for the string "ISO-8859-1" in the Forrest sources, and
> got many hits, but nothing that seemed to relate to Dispatcher.
The *.body.xml comes from the dataModel.xmap:
<!-- HTML rendered from intermediate format -->
<map:match pattern="**.body.xml">
<map:generate src="cocoon:/{1}.source.rewritten.xml" />
<map:transform src="{lm:dataModel-html-document-to-html.xsl}">
<map:parameter name="path" value="{1}.html" />
</map:transform>
<map:serialize />
</map:match>
The serializer here is the default one.
we define it in the xmap as
<map:serializers default="xml" />
That should read:
<map:serializers default="xml-utf8" />
I added to revision 946939 please see whether that fixes the issue. I added a
test note to
org.apache.forrest.plugin.internal.dispatcher/src/documentation/content/xdocs/index.xml
so you can directly run "forrest run" in the plugin and see the outcome.
If we done testing we should remove the debug note.
salu2
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>