Peter Volkov skribis 2008-07-11 10:10 (+0400):
> The problem is that in Linux (Gentoo and Debian I've tried) /\w/ does
> not match Russian letter while I use locale and LC_COLLATE is set to
> ru_RU.UTF-8.
\w should match Cyrillic letters even without "use locale". You might be
running into an annoying bug which makes \w lose its unicode support
depending on the *internal* state of a value. To work around this bug,
read Unicode::Semantics on CPAN and use it or utf8::upgrade.
> Linux $ perl -e 'use locale; open(IN, "< test-file"); while(<IN>) { print if
> /\w/; }'
> string with spaces (not only with [:alnum:])
> English;
> hello_привет
Despite the above there's a slightly more important issue here. You're
opening a text file but you don't specify the character encoding.
Likewise, you need to specify the encoding for output.
Assuming utf8 for both:
perl -le'
binmode STDOUT, ":encoding(utf8)";
open my $in, "< :encoding(utf8)", "test-file";
while (<$in>) {
print "match: [$1]" if /(\w+)/;
}
'
Which on my system prints:
match: [слово]
match: [строка]
match: [string]
match: [English]
match: [hello_привет]
I'm not sufficiently familiar with "use encoding" to say anything about
it, but you shouldn't need it just for this.
> Do I understand correctly that we should always supply encoding of
> streams?
Yes.
> If yes, why in FreeBSD this works without supplying any encoding and is
> it possible (good idea) to do the same in Linux?
I have no idea.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,
Juerd Waalboer: Perl hacker <[EMAIL PROTECTED]> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <[EMAIL PROTECTED]>
1;