[cross posted because people on the cocoon list might hit this as well]
I've always tested xindice with english documents, so I didn't notice
this behavior until today when I imported an italian XML document.
The document is encoded using UTF-8 and looks like this:
<?xml version="1.0" encoding="UTF-8"?>
...
<subtitle>
In sempre più film il computer con la Mela è l'arma
dei giusti contro criminali di ogni specie che invece
preferiscono i pc
</subtitle>
...
[this is a news document taken from an italian on-line newspaper]
ù -> �
è -> �
are the two unicode translations for the non-ASCII character (since
UTF-8 is back compatible to ASCII you don't note any difference until
you use non-ASCII letters such as these)
Opening the document in Explorer or XML-Spy yields the correct
characters.
Then I import it into the database and I access it from the cocoon
XML:DB source I get (in the explorer window):
<?xml version="1.0" encoding="UTF-8" ?>
...
<subtitle>
In sempre più film il computer con la Mela è l'arma dei giusti
contro criminali di ogni specie che invece preferiscono i pc
</subtitle>
same thing when opening the source from the the notepad window. But in
win2k notepad is UNICODE-aware... so I saved the source on disk and I
opened it with UltraEdit (which is UNICODE-aware but has a nice binary
view) and voila'
...
<subtitle>
In sempre più film il computer con la Mela è
l'arma dei giusti contro criminali di ogni specie
che invece preferiscono i pc
</subtitle>
...
where I believe that
à -> �
¹ -> �
This similarity in encoding probably shows why nobody noticed this
before.
So I went directly into the news.tbl and got the same bytes:
n sempre più film il compu
ter con la Mela è l'arma d
ei giusti
which clearly indicates that 'xindice' command line import tool is
somewhat ignoring the 'UTF-8' encoding and performing UTF-8 encoding on
something that is *already* UTF-8 encoded.
My perception is that there is nothing wrong in the way XIndice or
Cocoon get the information *out* of the database: the problem resides on
how the information gets *in* the database.
I would suggest the XIndice dev community to consider this bug a
showstopper for the 1.0 final release.
--
Stefano Mazzocchi One must still have chaos in oneself to be
able to give birth to a dancing star.
<[EMAIL PROTECTED]> Friedrich Nietzsche
--------------------------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]