Re: Pattern Match

Rob Dixon Tue, 09 Dec 2003 14:20:42 -0800

Tom Kinzer wrote:
>
> Rob Dixon wrote:
> >
> > Eric Sand wrote:
> > >
> > >     I am very new to Perl, but I sense a great adventure ahead after just
> > > programming with Cobol, Pascal, and C over the last umpteen years. I have
> > > written a perl script where I am trying to detect a non-printing
> > > character(Ctrl@ - Ctrl_) and then substitute  a printing ASCII sequence
> > such
> > > as "^@" in its place, but it does not seem to work as I would like. Any
> > > advice would be greatly appreciated.
> > >
> > >          Thank You....Eric Sand
> > >
> > >
> >
> > Your obvious guess is to write Perl as if it were C. That's slightly better
> > than treating it as a scripting language, but there are many joys left to be
> > found!
> >
> > > $in_ctr=0;
> > > $out_ctr=0;
> > >
> > > while ($line = <STDIN>)
> > >     {
> > >     chomp($line);
> > >     $in_ctr ++;
> > >     if ($line = s/\c@,\cA,\cB,\cC,\cD,\cE,\cF,\cG,\cH,\cI,\cJ,\cK,
> > >                   \cL,\cM,\cN,\cO,\cP,\cQ,\cR,\cS,\cT,\cU,\cV,\cW,
> > >                   \cX,\cY,\cZ,\c[,\c\,\c],\c^,\c_
> > >                  /^@,^A,^B,^C,^D,^E,^F,^G,^H,^I,^J,^K,
> > >                   ^L,^N,^N,^O,^P,^Q,^R,^S,^T,^U,^V,^W,
> > >                   ^X,^Y,^Z,^[,^\,^],^^,^_/)
> > >         {
> > >         $out_ctr ++;
> > >         printf("Non-printing chars detected in: %s\n",$line);
> > >         }
> > >     }
> > > printf("Total records read                                 =
> > %d\n",$in_ctr);
> > > printf("Total records written with non-printing characters =
> > %d\n",$out_ctr);
> >
> > I would write this as below. The first things is to *always*
> >
> >   use strict;
> >   use warnings;
> >
> >
> > after which you have to declare all of your variables with 'my'.
> >
> > The second is to get used to using the default $_ variable which
> > is set to the value for the current 'while(<>)' or 'for' loop
> > iteration, and is a default parameter for most built-in functions.
> >
> > Finally, in your particular case you're using the s/// (substitute)
> > operator wrongly. The first part, s/here//, is a regular expression,
> > not a list of characters. You'll need to read up on these at
> >
> >   perldoc perlre
> >
> > The second part, s//here/, is a string expression which can use
> > 'captured' sequences (anything in brackets) from the first part
> > and, with the addition of the s///e (executable) qualifier can
> > also be an executable statement. Here I've used it to add 0x20
> > to the ASCII value of the control character grabbed by the regex.
> >
> > A lot of this won't make sense until you learn some more, but I
> > hope you'll agree that this code is cuter than your original?
> >
> > HTH,
> >
> > Rob
> >
> >
> >
> > use strict;
> > use warnings;
> >
> > my $in_ctr = 0;
> > my $out_ctr = 0;
> >
> > while (<>) {
> >
> >   chomp;
> >
> >   $in_ctr++;
> >
> >   if (s/([\x00-\1F])/'^'.chr(ord($1) + 0x40)/eg) {
> >     $out_ctr++;
> >     printf "Non-printing chars detected in: %s\n", $_;
> >   }
> > }
> >
> > printf "Total records read                                 = %d\n", $in_ctr;
> > printf "Total records written with non-printing characters = %d\n",
> > $out_ctr;
>
> Rob, can you explain the details of that replace?  That's pretty slick.  I
> see you're adding the hex value to get to the appropriate ASCII value, but
> didn't know you could do some of that gyration inside a regex.


I didn't think it was slick at all. In fact I was disappointed that it looked
such a mess, but I don't see a better way. Anyway, the statement is

  s/([\x00-\1F])/'^'.chr(ord($1) + 0x40)/eg

where the regex is

  ([\x00-\1F])

The enclosing parentheses capture the entire regex as $1 for use later
in the replacement expression or even in a later statement. Within that
is a character class [ .. ] which is simply all control characters. It's
the first 'column' of the 7-bit 128-character ASCII set with byte values
0 through 31 or 0x00 through 0x1F. It would be better expressed as

  [[:cntrl:]]

which is identical but describes what you /mean/ rather than how your
machine should do it.

OK, so we've captured one control character into $1. Then comes the
replacement string, which can be an executable expression with the /e
modifier on the substitution. Note that for simple interpolation of
variables like the captured $1, $2 etc, and in fact any variable
(including arrays and hashes) in scope, there is no need for /e. It is
only necessary if there are operators or subroutines that need to
be executed to build the replacement string.

It's a mess because there is no way of relating control characters
(e.g. CR) with their alphabetic equivalents (e.g. CTRL/M) without
doing character arithmetic. And that's not what characters do in
/real/ life.

In

  '^'.chr(ord($1) + 0x40)

ord($1) returns the byte value of the control character.

  + 0x40

moves that byte value from the first column (control characters) to the
third column (capital alphas)

  chr()

turns that byte value back into a one-character ASCII string.

  '^'.

appends a caret before that character. Hence "\cM" becomes
'^M'.

All that is left is the /g modifier, which simply replaces
all instances of the regex instead of just the first one found.

I hope this helps. It's useful for me to tie down my
programming to first principles once in a while and ask
/why/ did I write that?

Cheers guys.

Rob



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Pattern Match

Reply via email to