On Sat, Feb 7, 2009 at 9:25 AM, John Doe <[email protected]> wrote:

>
> I am having difficulty while parsing some Turkish sites.Here is the
> part of the code. The problem is when the title contains some non-UTF
> characters like ç,ü,ı,ö,ğ it stops parsing and doesnt read the rest.
> For example if the title is "Ebru Gündeş askere gitti" it only reads
> until "ş" which is "Ebru G". Or when reading "Serdar Ortaç sünnet
> oldu" it only read "Serdar Or"
> How can I fix the problem? Any suggestions???


Well, what you probably mean is "non-ASCII".  Those are all perfectly
satisfactory Unicode characters.  What you need to do is figure out what
"encoding" your text is in.  At the top of the XML there should be a thing
that looks like

<?xml version="1.0" encoding="XXX" ?>

If the encoding of your file is UTF-8 (the default) you don't need the
encoding= bit.  So one thing would be to convert your file to UTF-8; there
are a bunch of encoding conversion tools around.  Another would be to figure
out out the encoding, quite possibly it's in ISO-Latin-1, and put in
"ISO-Latin-1" for "XXX" above.  Another would be to convert the non-ASCII
characters to what are called "numeric character references".  For
example, ç is U+00E7 in Unicode, so you can include it as &#xe7; in your XML
and everything will be fine.

The real lesson is that internationalization is hard.  I've written a couple
of things on this that are widely read and might be useful.

http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

 -Tim

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Android Developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/android-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to