Thanks very much for your further information about this issue. I'll be happy to file a bug report, but I should also mention that the problematic behavior not only exists with "use encoding 'utf8'" and "use utf8", but differs between them. Both produce wrong results, but different wrong results:
With “use encoding 'utf8'”: The NBS is matched by /\N{NO-BREAK SPACE}/ The NBS is matched by / / (a no-break space) The NBS is matched by /[\7f-\x80]/ The NBS is NOT matched by /[\xa0]/ The NBS is NOT matched by /\xa0/ The NBS is NOT matched by /\N{U+00a0}/ The NBS is NOT matched by /junk/ With “use utf8”: The NBS is matched by /\N{NO-BREAK SPACE}/ The NBS is matched by / / (a no-break space) The NBS is matched by /[\7f-\x80]/ The NBS is matched by /[\xa0]/ The NBS is matched by /\xa0/ The NBS is matched by /\N{U+00a0}/ The NBS is NOT matched by /junk/ With neither âuse encoding 'utf8'â nor âuse utf8â: The NBS is matched by /\N{NO-BREAK SPACE}/ The NBS is NOT matched by /Â / (a no-break space) The NBS is matched by /[\7f-\x80]/ The NBS is matched by /[\xa0]/ The NBS is matched by /\xa0/ The NBS is matched by /\N{U+00a0}/ The NBS is NOT matched by /junk/ (The 3rd and 7th patterns, out of 7, should fail.) (If I include both statements, the behavior is the same as if "use encoding 'utf8'" alone is present. This testing is with "<:encoding(utf8)".) So, I'm confused as to whether this is 1 bug or more than 1, and how best to document it (or them). Could you advise me on this? On 30 Nov 2010, at 10:25, karl williamson wrote: > Jonathan Pool wrote: >> Let's say the character NO-BREAK SPACE (U+00A0) appears in a UTF8-encoded >> text file (so it appears there as C2A0), and I want to match strings that >> contain this character. >> I write a script (itself encoded with UTF8) in Perl 5.10.0 (on OS X 10.6.5) >> with: >> use encoding 'utf8'; >> use charnames ':full:'; >> The script opens the file with: >> open FH, '<:utf8', filename.txt; > > You should always use '<:encoding(utf8)' instead to get utf8 validation. > But that's not the problem here. > I tested it on the very latest development code, and it still fails. The > problem is a bug or bugs in Perl with parsing files encoded in utf8. I > converted the .pl to latin1 and removed the "use encoding 'utf8'", and it > works. > > I believe it is known that there are issues with 'use encoding', but I > suggest filing a bug report, by sending email to perl...@perl.org. Attached > are two files I created to test. These should be attached to the bug report > so as to not have to be done again. >> It reads lines in with: >> while <FH> {} >> Then, in a regular expression in the script, I can match the NO-BREAK SPACE >> with any of these patterns: >> 1. /\N{NO-BREAK SPACE}/ >> 2. / / (where the character between slashes looks like a space but is a >> no-break space) >> 3. /[\x7f-\x80]/ >> Patterns 1 and 2 make sense, but pattern 3 is mysterious to me, because the >> range specified in pattern 3 includes DELETE and an unnamed character but >> does not include NO-BREAK SPACE. >> Moreover, I expect to be able to match the NO-BREAK SPACE with these >> patterns, but I cannot: >> 4. /[\xa0]/ >> 5. /\xa0/ >> In the related documentation, I have not found anything explaining why >> pattern 3 works, or anything explaining why patterns 4 and 5 do not work. >> I have replicated these anomalies in Perl 5.8.8. under Red Hat Enterprise >> Linux 5. >> I would be delighted to receive explanations or references to documentation >> that I have overlooked or misunderstood. > <nobreak_latin1.pl><nobreak_utf8.pl> ˉ