Hello,

I'm trying to process some UTF-8 encoded files (wikipedia's extracts) through Text::MediawikiFormat.

Works rather fine as far as the HTML convertion goes, except that the character set encoding gets lost along the way, e.g. what used to be properly UTF-8 encoded Russian (вычислительная машина) gets mangled into √ê¬≤√ë&...

Here is what I presently do:

use File::Find;
use File::Slurp;
use Text::MediawikiFormat as => Format;

find( \&process, "/Volumes/Staten/wiki/content/z" );

sub process
{
    if ( $File::Find::name =~ /text.wiki$/ )
    {
        print "$File::Find::name\n";

        my $data = read_file( $File::Find::name, binmode => ':utf8' );
        my $text = Format( $data );

        write_file( 'text.html', { binmode => ':utf8' }, $text ) ;
    }
}

I tried most of Ivan Kurmanov recommendations in the article bellow, but to no eval:

"Unicode-processing issues in Perl and how to cope with it"
http://ahinea.com/en/tech/perl-unicode-struggle.html

Out of desperation, I patched File::Slurp to add explicit binmode support as described bellow, but that didn't help either:

"Bug#429933: libfile-slurp-perl: Please support UTF8 binary modes"
http://www.mail-archive.com/[EMAIL PROTECTED]/ msg360928.html

What am I doing wrong? Any simple example on how to fully read the content of an UTF-8 encoded file, process it and write the result back to the file system without loss of character set encoding?

Any pointers much appreciated.

Thanks in advance.

Kind regards,

PA.




--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to