Re: character setts in a regexp

Jim Gibson Fri, 11 Jan 2013 14:32:02 -0800

On Jan 11, 2013, at 2:01 PM, Christer Palm wrote:

> Hi!
> 
> I have a perl script that parses RSS streams from different news sources and 
> experience problems with national characters in a regexp function used for 
> matching a keyword list with the RSS data. 
> 
> Everything works fine with a simple regexp for plain english i.e. words 
> containing the letters A-Z, a-z, 0-9.    
> 
> if ( $description =~ m/\b$key/i ) {….}
> 
> Keywords or RSS data with national characters don’t work at all. I’m not 
> really surprised this was expected as character sets used in the different 
> RSS streams are outside my control.
> 
> I am have the ”use utf8;” function activated but I’m not really sure if it is 
> needed. I can’t see any difference used or not.


The 'use utf8;' is necessary if you have UTF-8 characters in your Perl source 
file that you want interpreted correctly, e.g., in string literals or variable 
names.

> 
> If a convert all the national characters used in the keyword list to html 
> type ”&aring” and so on. Changes every occurrence of octal, unicode 
> characters used i.e. decimal and hex to html type in the RSS data in a 
> character parser everything works fine but takes time that I don’t what to 
> avoid.   
> 
> Do you have suggestions on this character issue? Is it possible to determine 
> the character set of a text efficiently? Is it other ways to solve the 
> problem?

Have you read the following?

  perldoc perlunitut
  perldoc perlunicode
  perldoc perlunifaq


--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: character setts in a regexp

Reply via email to