Why do you think the file is correct? 

It sounds like 'vi' (aka vim?) did not write a valid utf-8 file ... Maybe it 
was working in some 8bit character set (i.e., a non unicode 256 character code 
set, e.g., ASCII) and used some value in the upper 128 characters (the 
non-unicode part of ASCII) - which would be interpreted by unicode processing 
as a multi-byte character, but the subsequent bytes would not be correct.

Cat'ing to the terminal probably won't work as most terminal emulators are 
running with some 8bit character code (e.g., ASCII) with 256 characters and not 
UNICODE.
Even using 'vi' you probably have to tell it that the file is utf-8 and not 
ASCII. 

-----Original Message-----
From: Pander [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 06, 2006 12:07 PM
To: dev@castor.codehaus.org
Subject: [castor-dev] How to handle UTF-8 characters like · and ô

>>> Because I did not got an answer on via the post in the user list, I
repost this in the dev list <<<

Hi all,

With Castor 1.0M1 and Java 1.4.2 I have the following problem with special 
characters. (Both with Blackdown Java(TM) 2 SDK, Standard Edition, Ubuntu 
Breezy Badger package AND j2sdk from sun for Linux.)

An XML file holds special characters as · (centered dot) and ô (o with a ^ 
above it). The XML file has been created with vi but when I cat it to my 
terminal these special characters look like empty squares. When I unmarshal the 
XML file and write the string to a file, these special characters are all black 
diamonds with a white question mark inside (both when opening with vi or 
catting to my terminal).

I have tested the XML file with:
        <?xml version="1.0" encoding="Latin1"?> and
        <?xml version="1.0" encoding="UTF-8"?> Both give the same result as 
described above.

However validating the XML with org.apache.xerces.parsers.DOMParser
results in an error for the UTF-8 case:
        [Fatal Error] test.xml:8:47: Invalid byte 2 of 4-byte UTF-8 sequence.
        Exception caught in main: org.xml.sax.SAXParseException: Invalid byte 2 
of 4-byte UTF-8 sequence.

How can I fix this so I can use these special characters?

Thanks,

Pander


-------------------------------------------------
If you wish to unsubscribe from this list, please send an empty message to the 
following address:

[EMAIL PROTECTED]
-------------------------------------------------




-------------------------------------------------
If you wish to unsubscribe from this list, please
send an empty message to the following address:

[EMAIL PROTECTED]
-------------------------------------------------

Reply via email to