Re: Matching upper ASCII characters in RE patterns

Jonathan Pool Tue, 30 Nov 2010 11:22:45 -0800

Thanks very much for your further information about this issue.

I'll be happy to file a bug report, but I should also mention that the 
problematic behavior not only exists with "use encoding 'utf8'" and "use utf8", 
but differs between them. Both produce wrong results, but different wrong 
results:


With “use encoding 'utf8'”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is matched by / / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is NOT matched by /[\xa0]/
The NBS is NOT matched by /\xa0/
The NBS is NOT matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/

With “use utf8”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is matched by / / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is matched by /[\xa0]/
The NBS is matched by /\xa0/
The NBS is matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/

With neither âuse encoding 'utf8'â nor âuse utf8â:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is NOT matched by /Â / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is matched by /[\xa0]/
The NBS is matched by /\xa0/
The NBS is matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/

(The 3rd and 7th patterns, out of 7, should fail.)

(If I include both statements, the behavior is the same as if "use encoding 
'utf8'" alone is present. This testing is with "<:encoding(utf8)".)

So, I'm confused as to whether this is 1 bug or more than 1, and how best to 
document it (or them). Could you advise me on this?

On 30 Nov 2010, at 10:25, karl williamson wrote:

> Jonathan Pool wrote:
>> Let's say the character NO-BREAK SPACE (U+00A0) appears in a UTF8-encoded 
>> text file (so it appears there as C2A0), and I want to match strings that 
>> contain this character.
>> I write a script (itself encoded with UTF8) in Perl 5.10.0 (on OS X 10.6.5) 
>> with:
>> use encoding 'utf8';
>> use charnames ':full:';
>> The script opens the file with:
>> open FH, '<:utf8', filename.txt;
> 
> You should always use '<:encoding(utf8)' instead to get utf8 validation.
> But that's not the problem here.
> I tested it on the very latest development code, and it still fails. The 
> problem is a bug or bugs in Perl with parsing files encoded in utf8.  I 
> converted the .pl to latin1 and removed the "use encoding 'utf8'", and it 
> works.
> 
> I believe it is known that there are issues with 'use encoding', but I 
> suggest filing a bug report, by sending email to perl...@perl.org. Attached 
> are two files I created to test.  These should be attached to the bug report 
> so as to not have to be done again.
>> It reads lines in with:
>> while <FH> {}
>> Then, in a regular expression in the script, I can match the NO-BREAK SPACE 
>> with any of these patterns:
>> 1. /\N{NO-BREAK SPACE}/
>> 2. / / (where the character between slashes looks like a space but is a 
>> no-break space)
>> 3. /[\x7f-\x80]/
>> Patterns 1 and 2 make sense, but pattern 3 is mysterious to me, because the 
>> range specified in pattern 3 includes DELETE and an unnamed character but 
>> does not include NO-BREAK SPACE.
>> Moreover, I expect to be able to match the NO-BREAK SPACE with these 
>> patterns, but I cannot:
>> 4. /[\xa0]/
>> 5. /\xa0/
>> In the related documentation, I have not found anything explaining why 
>> pattern 3 works, or anything explaining why patterns 4 and 5 do not work.
>> I have replicated these anomalies in Perl 5.8.8. under Red Hat Enterprise 
>> Linux 5.
>> I would be delighted to receive explanations or references to documentation 
>> that I have overlooked or misunderstood.
> <nobreak_latin1.pl><nobreak_utf8.pl> 

ˉ

Re: Matching upper ASCII characters in RE patterns

Reply via email to