match with 'use locale' misses letters in utf8 locale

Juerd Waalboer Fri, 11 Jul 2008 00:00:31 -0700

Peter Volkov skribis 2008-07-11 10:10 (+0400):
> The problem is that in Linux (Gentoo and Debian I've tried) /\w/ does
> not match Russian letter while I use locale and LC_COLLATE is set to
> ru_RU.UTF-8.


\w should match Cyrillic letters even without "use locale". You might be
running into an annoying bug which makes \w lose its unicode support
depending on the *internal* state of a value. To work around this bug,
read Unicode::Semantics on CPAN and use it or utf8::upgrade.

> Linux $ perl -e 'use locale; open(IN, "< test-file"); while(<IN>) { print if 
> /\w/; }'
> string with spaces (not only with [:alnum:])
> English;
> hello_привет

Despite the above there's a slightly more important issue here. You're
opening a text file but you don't specify the character encoding.
Likewise, you need to specify the encoding for output.

Assuming utf8 for both:

    perl -le'
        binmode STDOUT, ":encoding(utf8)";
        open my $in, "< :encoding(utf8)", "test-file";
        while (<$in>) {
            print "match: [$1]" if /(\w+)/;
        }
    '

Which on my system prints:

    match: [слово]
    match: [строка]
    match: [string]
    match: [English]
    match: [hello_привет]

I'm not sufficiently familiar with "use encoding" to say anything about
it, but you shouldn't need it just for this.

> Do I understand correctly that we should always supply encoding of
> streams?

Yes.

> If yes, why in FreeBSD this works without supplying any encoding and is
> it possible (good idea) to do the same in Linux?

I have no idea.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <[EMAIL PROTECTED]>  <http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy <[EMAIL PROTECTED]>
1;

Re: /\w/ match with 'use locale' misses letters in utf8 locale

Reply via email to