On 30 Sep 2008, at 20:08, Gnana Arasan wrote:
We are inserting the xml(UTF-8) conent using session.insertContent(uri,inputstream,options).by default option encoding is UTF-8.(ML version 3.5-2).For example person name josé is stored.In cq using doc(uri) the content seems to be José .

The thing to do is to check the string-length() of "José".

If it's 4, then it's being stored correctly in MarkLogic. This means that the issue is to do with output — something is interpreting UTF-8 as ISO-8859-1.

If it's 5 then it's being stored incorrectly in MarkLogic. This means that the input processes you thought were sending in UTF-8 are really interpreting the data as ISO-8859-1. I'd guess from your input mail that you're using Java to read the content in. I'd be extremely careful in Java, as it's all too easy to use the "system default encoding" by accident. This is normally cp-1252 on Windows, or MacRoman on a mac, neither of which is particularly useful.

Any time you read data in Java, you need to specify an encoding. Particular candidates to watch out for include FileReader and String.getBytes(). If you examine the code that's creating that inputStream, you may well find such an example.

-Dom
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to