At 04:05 PM +0000 on 03/12/2011, John Delacour wrote about Re:
transliterate into cyrillic:
At some point it is likely that an attempt was made to convert
something to utf-8 and the raw bytes of the supposed utf-8 were then
converted to decimal html entities where they were outside the range
of iso-8859-1
Anything that is in UTF-8 has each byte between x00-7F if it is
US-ASCII or xC0 or above followed by one or more characters in the
x80-BF range. The number of characters in the UTF-8 string is based
on the number of 1 bits at the start of the first character before
you get to a 0 bit (thus 110xxxxx is 2 bytes [1 following character],
1110xxxx is 3 bytes [2 following characters], etc.) All following
characters are of the form 10xxxxxx (so if you find one, you look
left until you find one that is of the form 11xxxxxx which is a start
character). Details are at http://en.wikipedia.org/wiki/Utf8.
As to the mangling issue, the codes do not match something converted
into UTF-8. For real Unicode Cyrillic (like the good sample in the
1000 range here is the breakdown:
Cyrillic is from Ѐ to ԯ as Unicode. This corresponds to
Ѐ to Ô&xAF; when UTF-8 encoded.
The numbers are off for real UTF-8, even if the two bytes are merged into one.
--
You received this message because you are subscribed to the
"BBEdit Talk" discussion group on Google Groups.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
<http://groups.google.com/group/bbedit?hl=en>
If you have a feature request or would like to report a problem,
please email "[email protected]" rather than posting to the group.
Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>