Jonathan Pool wrote:
Thanks very much for your further information about this issue.
I'll be happy to file a bug report, but I should also mention that the problematic behavior not
only exists with "use encoding 'utf8'" and "use utf8", but differs between
them. Both produce wrong results, but different wrong results:
Just one bug report will be fine. I don't have a Perl 5.10 laying
around to test on, but I can say that the files I sent you did what I
said on 5.13.7. I think that the one that was supposedly in latin1
could have gotten converted to utf8 in the email process. There have
been many significant bug fixes in Perl since 5.10.0.
With “use encoding 'utf8'”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is matched by / / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is NOT matched by /[\xa0]/
The NBS is NOT matched by /\xa0/
The NBS is NOT matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/
With “use utf8”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is matched by / / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is matched by /[\xa0]/
The NBS is matched by /\xa0/
The NBS is matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/
With neither âuse encoding 'utf8'â nor âuse utf8â:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is NOT matched by /Â / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is matched by /[\xa0]/
The NBS is matched by /\xa0/
The NBS is matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/
(The 3rd and 7th patterns, out of 7, should fail.)
(If I include both statements, the behavior is the same as if "use encoding 'utf8'" alone is
present. This testing is with "<:encoding(utf8)".)
So, I'm confused as to whether this is 1 bug or more than 1, and how best to
document it (or them). Could you advise me on this?
On 30 Nov 2010, at 10:25, karl williamson wrote:
Jonathan Pool wrote:
Let's say the character NO-BREAK SPACE (U+00A0) appears in a UTF8-encoded text
file (so it appears there as C2A0), and I want to match strings that contain
this character.
I write a script (itself encoded with UTF8) in Perl 5.10.0 (on OS X 10.6.5)
with:
use encoding 'utf8';
use charnames ':full:';
The script opens the file with:
open FH, '<:utf8', filename.txt;
You should always use '<:encoding(utf8)' instead to get utf8 validation.
But that's not the problem here.
I tested it on the very latest development code, and it still fails. The problem is a bug
or bugs in Perl with parsing files encoded in utf8. I converted the .pl to latin1 and
removed the "use encoding 'utf8'", and it works.
I believe it is known that there are issues with 'use encoding', but I suggest
filing a bug report, by sending email to perl...@perl.org. Attached are two
files I created to test. These should be attached to the bug report so as to
not have to be done again.
It reads lines in with:
while <FH> {}
Then, in a regular expression in the script, I can match the NO-BREAK SPACE
with any of these patterns:
1. /\N{NO-BREAK SPACE}/
2. / / (where the character between slashes looks like a space but is a
no-break space)
3. /[\x7f-\x80]/
Patterns 1 and 2 make sense, but pattern 3 is mysterious to me, because the
range specified in pattern 3 includes DELETE and an unnamed character but does
not include NO-BREAK SPACE.
Moreover, I expect to be able to match the NO-BREAK SPACE with these patterns,
but I cannot:
4. /[\xa0]/
5. /\xa0/
In the related documentation, I have not found anything explaining why pattern
3 works, or anything explaining why patterns 4 and 5 do not work.
I have replicated these anomalies in Perl 5.8.8. under Red Hat Enterprise Linux
5.
I would be delighted to receive explanations or references to documentation
that I have overlooked or misunderstood.
<nobreak_latin1.pl><nobreak_utf8.pl>
ˉ