At 04:51 -0800 12/03/2011, eleven wrote:

I would still like to know if it is even possible to do transliteration chores like this within BBEdit thru a text factory or similiar, as well as any wisdom as to how to go about it...

Your problem here is that is is not simple transliteration.

You can convert the decimal html entities easily into characters using a UNIX filter something like this:

#!/usr/bin/perl
use strict;
no warnings;
while(<>){
  s~&#(\d\d\d);~chr($1)~eg;
  print;
}

but that will get you nowhere unless you know, or can guess, what transformations the original Cyrillic was subjected to in the process of producing the garbage. At some point it is likely that an attempt was made to convert something to utf-8 and the raw bytes of the supposed utf-8 were then converted to decimal html entities where they were outside the range of iso-8859-1 -- you will see that characters within range have not been so encoded. The original Cyrillic could have been in any one of four distinct encodings. This makes the task even more difficult. The problem arises quite frequently in badly managed European sites but here the chances are that the original text was windows-1252 and it's easier to follow the process back.

JD




--
You received this message because you are subscribed to the "BBEdit Talk" discussion group on Google Groups.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
<http://groups.google.com/group/bbedit?hl=en>
If you have a feature request or would like to report a problem, please email "[email protected]" rather than posting to the group.
Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>

Reply via email to