Summary (Re: RH/debian regular expression weirdness)

Ethan Alpert Sun, 28 Sep 2003 07:50:26 -0700

The problem was 5.8.0's use of unicode use of UTF8 locales.

My solution was to install 5.8.1.


The problem is described in:

http://search.cpan.org/src/JHI/perl-5.8.1/pod/perldelta.pod

Here's the excerpt:

UTF-8 no longer default under UTF-8 locales

In Perl 5.8.0 many Unicode features were introduced.   One of them
was found to be of more nuisance than benefit: the automagic
(and silent) "UTF-8-ification" of filehandles, including the
standard filehandles, if the user's locale settings indicated
use of UTF-8.

For example, if you had C<en_US.UTF-8> as your locale, your STDIN and
STDOUT were automatically "UTF-8", in other words an implicit
binmode(..., ":utf8") was made.  This meant that trying to print, say,
chr(0xff), ended up printing the bytes 0xc3 0xbf.  Hardly what
you had in mind unless you were aware of this feature of Perl 5.8.0.
The problem is that the vast majority of people weren't: for example
in RedHat releases 8 and 9 the B<default> locale setting is UTF-8, so
all RedHat users got UTF-8 filehandles, whether they wanted it or not.
The pain was intensified by the Unicode implementation of Perl 5.8.0
(still) having nasty bugs, especially related to the use of s/// and
tr///.  (Bugs that have been fixed in 5.8.1)

Therefore a decision was made to backtrack the feature and change it
from implicit silent default to explicit conscious option.  The new
Perl command line option C<-C> and its counterpart environment
variable PERL_UNICODE can now be used to control how Perl and Unicode
interact at interfaces like I/O and for example the command line
arguments.  See L<perlrun/-C> and L<perlrun/PERL_UNICODE> for more
information.

You can also now use safe signals with POSIX::SigAction.
See L<POSIX/POSIX::SigAction>.


>
> Hello! My first post. Basically I have this script that runs differently
on
> my RH box that it does on by debian box. I'm hopeing someone might be able
> to point me in the right direction.  Both machines are running perl v5.8.0
> yet something is very different and I suspect something's wrong with my
red
> hat install.
>
> Here's the example:
>
> #!/usr/bin/perl -w
> use strict;
>
> while (my $line = <STDIN>) {
> $line =~ s/%([a-fA-F0-9]{2})/chr(hex($1))/ge;
> $line = unpack ("H*", $line);
>
> print $line . "\n";
>
> }
>
> test.dat:
> %948%F9%C5%F6%A7x%C4%95%A6%D2a%97%AB%1C%9F%EA%C5%0C%E6
> y%28%BA%E6%00%B82tI%C8%80%1E%90B%19%27G%01%84%BF
>
> Debian output *correct*:
> 7928bae600b8327449c8801e90421927470184bf0a
> 9438f9c5f6a778c495a6d26197ab1c9feac50ce60a
>
> RH output *wrong*:
> c29438c3b9c385c3b6c2a778c384c295c2a6c39261c297c2ab1cc29fc3aac3850cc3a60a
> 7928c2bac3a600c2b8327449c388c2801ec2904219274701c284c2bf0a
>
>
> As you can see something is way borked.  I have a feeling its something in
> the installation that being a newbie I don't know about.
>
> Another example of a problem porting the same script between machines I
had
> to convert the following line from:
>
> \"([^"\s]+)\s
>
> to:
>
> \"([^ "]+)\s
>
> So something is very different in the regular expression matching of the
> machines. Can anyone give me a hint why the same syntax on two different
> machines with the same version of  perl would give me different semantic
> results. I'm dieing to know.
>
> Thanks,
>
> -ethan
>
>
>
>
>
>
> --
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Summary (Re: RH/debian regular expression weirdness)

Reply via email to