Re: transliterate into cyrillic

Robert A. Rosenberg Sun, 13 Mar 2011 22:38:47 -0700

At 04:05 PM +0000 on 03/12/2011, John Delacour wrote about Re:transliterate into cyrillic:

At some point it is likely that an attempt was made to convertsomething to utf-8 and the raw bytes of the supposed utf-8 were thenconverted to decimal html entities where they were outside the rangeof iso-8859-1

Anything that is in UTF-8 has each byte between x00-7F if it isUS-ASCII or xC0 or above followed by one or more characters in thex80-BF range. The number of characters in the UTF-8 string is basedon the number of 1 bits at the start of the first character beforeyou get to a 0 bit (thus 110xxxxx is 2 bytes [1 following character],1110xxxx is 3 bytes [2 following characters], etc.) All followingcharacters are of the form 10xxxxxx (so if you find one, you lookleft until you find one that is of the form 11xxxxxx which is a startcharacter). Details are at http://en.wikipedia.org/wiki/Utf8.

As to the mangling issue, the codes do not match something convertedinto UTF-8. For real Unicode Cyrillic (like the good sample in the1000 range here is the breakdown:


Cyrillic is from &#x400; to &#x52F; as Unicode. This corresponds to
&#xD0;&#x80; to &#xD4;&xAF; when UTF-8 encoded.

The numbers are off for real UTF-8, even if the two bytes are merged into one.

--

You received this message because you are subscribed to the"BBEdit Talk" discussion group on Google Groups.

To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
<http://groups.google.com/group/bbedit?hl=en>

If you have a feature request or would like to report a problem,please email "[email protected]" rather than posting to the group.

Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>

Re: transliterate into cyrillic

Reply via email to