Hello,
I'm trying to process some UTF-8 encoded files (wikipedia's extracts)
through Text::MediawikiFormat.
Works rather fine as far as the HTML convertion goes, except that the
character set encoding gets lost along the way, e.g. what used to be
properly UTF-8 encoded Russian (вычислительная машина) gets mangled
into вÑ&...
Here is what I presently do:
use File::Find;
use File::Slurp;
use Text::MediawikiFormat as => Format;
find( \&process, "/Volumes/Staten/wiki/content/z" );
sub process
{
if ( $File::Find::name =~ /text.wiki$/ )
{
print "$File::Find::name\n";
my $data = read_file( $File::Find::name, binmode => ':utf8' );
my $text = Format( $data );
write_file( 'text.html', { binmode => ':utf8' }, $text ) ;
}
}
I tried most of Ivan Kurmanov recommendations in the article bellow,
but to no eval:
"Unicode-processing issues in Perl and how to cope with it"
http://ahinea.com/en/tech/perl-unicode-struggle.html
Out of desperation, I patched File::Slurp to add explicit binmode
support as described bellow, but that didn't help either:
"Bug#429933: libfile-slurp-perl: Please support UTF8 binary modes"
http://www.mail-archive.com/[EMAIL PROTECTED]/
msg360928.html
What am I doing wrong? Any simple example on how to fully read the
content of an UTF-8 encoded file, process it and write the result back
to the file system without loss of character set encoding?
Any pointers much appreciated.
Thanks in advance.
Kind regards,
PA.
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/