Den 20. mai. 2010 kl. 15.26 skrev Thorsten Scherler:
> On 20/05/2010, at 14:18, Sjur Moshagen wrote:
>>> ...
>>> Hmm, that is weird. Please try the following:
>>> - add a new contract that uses ñ, í and similar characters
>>> - see what comes out
>>
>> I added a blank contract that just printed the same line of characters I
>> used earlier for testing, and this is what came out:
>>
>> This is a text containing problematic characters:
>> a á c č d đ n ŋ s š t ŧ z ž ae æ oe ø ao å a¨ ä o¨ ö g ǥ h ħ u ʉ i ɨ
>>
>> That is, the text from the contract comes through just fine, but text coming
>> from a standard Forrest v2 document gets garbled.
>>
>> I have attached a picture of the page as it renders. The box comes from the
>> document, the text at the bottom is from the contract.
>
> Ok I see.
>
> Please post the dataUri you use for the contract. It seems that the utf-8 is
> lost in this step. If you have the dataUrl of the contract see what is coming
> out there, whether it is already scrambled or not.
I'm not sure about how to do this, but I'll try. The dataUri used in the
structurer is:
<forrest:contract name="content-main"
dataURI="cocoon://#{$getRequest}.body.xml"> <-- this is the
dataURI
<forrest:property name="content-main-conf">
<headings type="boxed"/>
</forrest:property>
</forrest:contract>
which I take to mean:
http://localhost:8888/index.body.xml
The text returned by that Uri is:
<?xml version="1.0" encoding="ISO-8859-1"?><div id="content"><h1>Divvun - Sámi
proofing tools project</h1><div id="content-main">
<div class="note"><div class="label">UTF-8 character test</div><div
class="content">
There seems to be problems with certain characters, but only in
Dispatcher:<br xmlns:xi="http://www.w3.org/2001/XInclude"/>
a á c č d đ n ŋ s š t ŧ z ž ae æ
oe ø ao å a¨ ä o¨ ö g ǥ h ħ u ʉ i ɨ
</div></div>
</div></div>
Two things to note here:
The encoding is specified as ISO-8859-1, which is wrong, and which leads to all
characters outside Latin1 to be encoded as numeric entities. In the next step,
this causes all non-ASCII, non-Latin1 characters to survive correctly, while
the Latin1 chars will be messed up when they are reinterpreted as UTF-8 later -
or something along these line.
I don't know where the encoding comes from - everything on my end is marked as
UTF-8. I grepped for the string "ISO-8859-1" in the Forrest sources, and got
many hits, but nothing that seemed to relate to Dispatcher.
Sjur