Thanks, Paul.  A very thoughtful response--I will try this out (I don't
recall every encountering the ?? operator, but if it works as advertised I
will likely use it a lot).

--Marc

On Thu, Apr 21, 2011 at 3:29 PM, Paul Johnson <p...@pjcj.net> wrote:

> On Thu, Apr 21, 2011 at 01:42:42PM -0400, Marc Perry wrote:
> > Hi,
> >
> > I was parsing a collection of HTML files where I wanted to extract a
> certain
> > block from each file, like this:
>
> This is where everyone will tell you to use some dedicated HTML parsing
> module.
>
> > > ./script.pl *.html
> >
> > my $accumulator;
> > my $capture_counter;
> >
> > while ( <> ) {
> >     if ( /<h1>/.../labelsub/ ) {
> >         $accumulator .= $_ unless /labelsub/;
> >         if ( /labelsub/ && !$capture_counter ) {
> >             print $accumulator;
> >             $capture_counter = 1;
> >         }
> >         else {
> >             next;
> >         }
> >     }
> >     else {
> >         next;
> >     }
> > }
> > continue { # flush out the variables and clean up
> >    if ( eof ) {
> >         close ARGV;
> >         $accumulator = '';
> >         $capture_counter = '';
> >     }
> > }
> >
> > The bit about the $capture_counter is because some of the files have
> > multiple blocks of text that could be accumulated, and I only want the
> first
> > block in the file.
> >
> > This usually works fine, until I encountered an input file that did not
> > contain the string 'labelsub' after the first '<h1>' regex pattern match.
> > Then the conditional if test continued to search in the incoming lines in
> > the next file (because I am processing a whole batch using the while (<>)
> > operator), which it eventually found, and then printed nothing, because
> at
> > the end-of-file of the previous file, the script flushed the contents of
> the
> > accumulator.
> >
> > One solution is to just run the same script individually on each file,
> but I
> > was wondering if there was a way to reset the 'state' of the range
> operator
> > pattern match at the end of the physical file (or at any other time for
> that
> > matter)?
>
> No, there isn't (unless you want to get fancy and use a closure or
> something) and so you'll need to find some other way to "end" the range.
> The obvious other end point is the end of file, and so you can have your
> range operator as:
>
>    if ( /<h1>/ ... /labelsub/ || eof ) {
>
> This will ensure that the range operator "ends" by the end of each file,
> but you'd need to do extra work because of the logic of the rest of your
> program.  So let's see if we can do something about that.
>
> Whilst it doesn't make a difference to the logic, I prefer to jump out
> of a loop early if I find it doesn't satisfy the conditions I'm looking
> for.  So I think that:
>
>    next unless /<h1>/ .. /labelsub/ || eof;
>
> looks tidier than the if else conditional.
>
> Then there's your logic to ensure you only count the first block in each
> file.  Perl has the little-known ?? counterpart to // which will only
> match once.  So making that line:
>
>    next unless ?<h1>? .. /labelsub/ || eof;
>
> Allows you to get rid of the $capture_counter variable.  But you'll need
> to add a reset to the continue block, to reset the ?? at the start of a
> new file.
>
> Finally, with this change you may as well just print $accumulator in the
> continue block too.  So we end up with
>
>    my $accumulator;
>
>    while ( <> ) {
>        next unless ?<h1>? .. /labelsub/ || eof;
>        $accumulator .= $_ unless /labelsub/;
>     }
>    continue { # flush out the variables and clean up
>        if ( eof ) {
>             print $accumulator;
>            $accumulator = '';
>            reset;
>        }
>    }
>
> which, I think, does what you are after.
>
> The docs mention that ?? is vaguely deprecated:
>
>    This usage is vaguely deprecated, which means it just might possibly
>    be removed in some distant future version of Perl, perhaps somewhere
>    around the year 2168.
>
> That doesn't sound too bad, but there was some talk of an earlier
> deprecation of the bare ?? syntax, so it might be safer to use m??
> instead.
>
> Interestingly (for me), this is the first time in over 20 years that I
> have found a legitimate use for ??, and the associated reset.
>
> --
> Paul Johnson - p...@pjcj.net
> http://www.pjcj.net
>

Reply via email to