At 04:05 PM +0000 on 03/12/2011, John Delacour wrote about Re: transliterate into cyrillic:

At some point it is likely that an attempt was made to convert something to utf-8 and the raw bytes of the supposed utf-8 were then converted to decimal html entities where they were outside the range of iso-8859-1

Anything that is in UTF-8 has each byte between x00-7F if it is US-ASCII or xC0 or above followed by one or more characters in the x80-BF range. The number of characters in the UTF-8 string is based on the number of 1 bits at the start of the first character before you get to a 0 bit (thus 110xxxxx is 2 bytes [1 following character], 1110xxxx is 3 bytes [2 following characters], etc.) All following characters are of the form 10xxxxxx (so if you find one, you look left until you find one that is of the form 11xxxxxx which is a start character). Details are at http://en.wikipedia.org/wiki/Utf8.

As to the mangling issue, the codes do not match something converted into UTF-8. For real Unicode Cyrillic (like the good sample in the 1000 range here is the breakdown:

Cyrillic is from Ѐ to ԯ as Unicode. This corresponds to
Ѐ to Ô&xAF; when UTF-8 encoded.

The numbers are off for real UTF-8, even if the two bytes are merged into one.

--
You received this message because you are subscribed to the "BBEdit Talk" discussion group on Google Groups.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
<http://groups.google.com/group/bbedit?hl=en>
If you have a feature request or would like to report a problem, please email "[email protected]" rather than posting to the group.
Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>

Reply via email to