On Jan 11, 2013, at 2:01 PM, Christer Palm wrote: > Hi! > > I have a perl script that parses RSS streams from different news sources and > experience problems with national characters in a regexp function used for > matching a keyword list with the RSS data. > > Everything works fine with a simple regexp for plain english i.e. words > containing the letters A-Z, a-z, 0-9. > > if ( $description =~ m/\b$key/i ) {….} > > Keywords or RSS data with national characters don’t work at all. I’m not > really surprised this was expected as character sets used in the different > RSS streams are outside my control. > > I am have the ”use utf8;” function activated but I’m not really sure if it is > needed. I can’t see any difference used or not.
The 'use utf8;' is necessary if you have UTF-8 characters in your Perl source file that you want interpreted correctly, e.g., in string literals or variable names. > > If a convert all the national characters used in the keyword list to html > type ”å” and so on. Changes every occurrence of octal, unicode > characters used i.e. decimal and hex to html type in the RSS data in a > character parser everything works fine but takes time that I don’t what to > avoid. > > Do you have suggestions on this character issue? Is it possible to determine > the character set of a text efficiently? Is it other ways to solve the > problem? Have you read the following? perldoc perlunitut perldoc perlunicode perldoc perlunifaq -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/