[noob] processing utf8 encoded files?

PA Mon, 27 Aug 2007 10:21:31 -0700

Hello,

I'm trying to process some UTF-8 encoded files (wikipedia's extracts)through Text::MediawikiFormat.

Works rather fine as far as the HTML convertion goes, except that thecharacter set encoding gets lost along the way, e.g. what used to beproperly UTF-8 encoded Russian (вычислительная машина) gets mangledinto √ê¬≤√ë&...


Here is what I presently do:

use File::Find;
use File::Slurp;
use Text::MediawikiFormat as => Format;

find( \&process, "/Volumes/Staten/wiki/content/z" );

sub process
{
    if ( $File::Find::name =~ /text.wiki$/ )
    {
        print "$File::Find::name\n";

        my $data = read_file( $File::Find::name, binmode => ':utf8' );
        my $text = Format( $data );

        write_file( 'text.html', { binmode => ':utf8' }, $text ) ;
    }
}

I tried most of Ivan Kurmanov recommendations in the article bellow,but to no eval:


"Unicode-processing issues in Perl and how to cope with it"
http://ahinea.com/en/tech/perl-unicode-struggle.html

Out of desperation, I patched File::Slurp to add explicit binmodesupport as described bellow, but that didn't help either:


"Bug#429933: libfile-slurp-perl: Please support UTF8 binary modes"

http://www.mail-archive.com/[EMAIL PROTECTED]/msg360928.html

What am I doing wrong? Any simple example on how to fully read thecontent of an UTF-8 encoded file, process it and write the result backto the file system without loss of character set encoding?


Any pointers much appreciated.

Thanks in advance.

Kind regards,

PA.




--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

[noob] processing utf8 encoded files?

Reply via email to