Parsing UTF8 files with wide characters

Robin Wed, 15 Jun 2005 11:48:48 -0700

I thought I'd understood how to use unicode support in perl, butevidently not. In the script below, I'm stumped as to:


1) why the regex won't match '月'.

2) why the substitution is carried out, but the result isn't in UTF8,nor is it UTF8 re-encoded in UTF8 (uncomment #require Encode;........... #Encode::decode_utf8($_); to test this )




TIA


Robin



 #!/usr/bin/perl -w

use strict;
use diagnostics-verbose;
#require Encode;


binmode (DATA,":utf8");


binmode (STDOUT,":utf8");


for (<DATA>){
        
        if (/(<[EMAIL PROTECTED]>)/gs){
        print "match: ",$1,"\n";
        #Encode::decode_utf8($_);
        s/$1/日本の/gs;
        
        }elsif(/(月)/gs){
        print "match: ",$1,"\n";
        s/$1/12月/gs;
        
        
        }
        
        print;
        
}       
        



__DATA__
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd";>
<HTML>
<HEAD>
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=utf-8">
        <TITLE> A Web Page</TITLE>  
</HEAD>
<BODY>
<BLOCKQUOTE>
<H3>日本語のnews<FONT COLOR=#FF3300>1月</FONT></H3>
... and this is a web page.
<P>
<IMG ALT="A Filler" WIDTH="450" HEIGHT="296">
<P>

hidden marker here -----><FONTCOLOR=#FF3300><[EMAIL PROTECTED]></FONT><------<BR>

</BLOCKQUOTE>
</BODY>
</HTML>

Parsing UTF8 files with wide characters

Reply via email to