At 04:51 -0800 12/03/2011, eleven wrote:
I would still like to know if it is even possible to do
transliteration chores like this within BBEdit thru a text factory
or similiar, as well as any wisdom as to how to go about it...
Your problem here is that is is not simple transliteration.
You can convert the decimal html entities easily into characters
using a UNIX filter something like this:
#!/usr/bin/perl
use strict;
no warnings;
while(<>){
s~&#(\d\d\d);~chr($1)~eg;
print;
}
but that will get you nowhere unless you know, or can guess, what
transformations the original Cyrillic was subjected to in the process
of producing the garbage. At some point it is likely that an attempt
was made to convert something to utf-8 and the raw bytes of the
supposed utf-8 were then converted to decimal html entities where
they were outside the range of iso-8859-1 -- you will see that
characters within range have not been so encoded. The original
Cyrillic could have been in any one of four distinct encodings. This
makes the task even more difficult. The problem arises quite
frequently in badly managed European sites but here the chances are
that the original text was windows-1252 and it's easier to follow the
process back.
JD
--
You received this message because you are subscribed to the
"BBEdit Talk" discussion group on Google Groups.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
<http://groups.google.com/group/bbedit?hl=en>
If you have a feature request or would like to report a problem,
please email "[email protected]" rather than posting to the group.
Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>