Re: Parsing UTF8 files with wide characters

Andrew Mace Wed, 15 Jun 2005 11:54:55 -0700

Try "use utf8" - it lets Perl know that your script contains utf8 chars.


More info: http://perlpod.com/5.9.1/lib/utf8.html


Andrew



On Jun 15, 2005, at 2:48 PM, Robin wrote:

I thought I'd understood how to use unicode support in perl, butevidently not. In the script below, I'm stumped as to:
1) why the regex won't match '月'.
2) why the substitution is carried out, but the result isn't inUTF8, nor is it UTF8 re-encoded in UTF8 (uncomment #requireEncode; ........... #Encode::decode_utf8($_); to test this )
TIA


Robin



 #!/usr/bin/perl -w

use strict;
use diagnostics-verbose;
#require Encode;


binmode (DATA,":utf8");


binmode (STDOUT,":utf8");


for (<DATA>){

    if (/(<[EMAIL PROTECTED]>)/gs){
    print "match: ",$1,"\n";
    #Encode::decode_utf8($_);
    s/$1/日本の/gs;

    }elsif(/(月)/gs){
    print "match: ",$1,"\n";
    s/$1/12月/gs;


    }

    print;

}




__DATA__
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd";>
<HTML>
<HEAD>
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=utf-8">
    <TITLE> A Web Page</TITLE>
</HEAD>
<BODY>
<BLOCKQUOTE>
<H3>日本語のnews<FONT COLOR=#FF3300>1月</FONT></H3>
... and this is a web page.
<P>
<IMG ALT="A Filler" WIDTH="450" HEIGHT="296">
<P>
hidden marker here -----><FONT COLOR=#FF3300><[EMAIL PROTECTED]></FONT><------<BR>
</BLOCKQUOTE>
</BODY>
</HTML>

Re: Parsing UTF8 files with wide characters

Reply via email to