Thanks, Paul. A very thoughtful response--I will try this out (I don't recall every encountering the ?? operator, but if it works as advertised I will likely use it a lot).
--Marc On Thu, Apr 21, 2011 at 3:29 PM, Paul Johnson <p...@pjcj.net> wrote: > On Thu, Apr 21, 2011 at 01:42:42PM -0400, Marc Perry wrote: > > Hi, > > > > I was parsing a collection of HTML files where I wanted to extract a > certain > > block from each file, like this: > > This is where everyone will tell you to use some dedicated HTML parsing > module. > > > > ./script.pl *.html > > > > my $accumulator; > > my $capture_counter; > > > > while ( <> ) { > > if ( /<h1>/.../labelsub/ ) { > > $accumulator .= $_ unless /labelsub/; > > if ( /labelsub/ && !$capture_counter ) { > > print $accumulator; > > $capture_counter = 1; > > } > > else { > > next; > > } > > } > > else { > > next; > > } > > } > > continue { # flush out the variables and clean up > > if ( eof ) { > > close ARGV; > > $accumulator = ''; > > $capture_counter = ''; > > } > > } > > > > The bit about the $capture_counter is because some of the files have > > multiple blocks of text that could be accumulated, and I only want the > first > > block in the file. > > > > This usually works fine, until I encountered an input file that did not > > contain the string 'labelsub' after the first '<h1>' regex pattern match. > > Then the conditional if test continued to search in the incoming lines in > > the next file (because I am processing a whole batch using the while (<>) > > operator), which it eventually found, and then printed nothing, because > at > > the end-of-file of the previous file, the script flushed the contents of > the > > accumulator. > > > > One solution is to just run the same script individually on each file, > but I > > was wondering if there was a way to reset the 'state' of the range > operator > > pattern match at the end of the physical file (or at any other time for > that > > matter)? > > No, there isn't (unless you want to get fancy and use a closure or > something) and so you'll need to find some other way to "end" the range. > The obvious other end point is the end of file, and so you can have your > range operator as: > > if ( /<h1>/ ... /labelsub/ || eof ) { > > This will ensure that the range operator "ends" by the end of each file, > but you'd need to do extra work because of the logic of the rest of your > program. So let's see if we can do something about that. > > Whilst it doesn't make a difference to the logic, I prefer to jump out > of a loop early if I find it doesn't satisfy the conditions I'm looking > for. So I think that: > > next unless /<h1>/ .. /labelsub/ || eof; > > looks tidier than the if else conditional. > > Then there's your logic to ensure you only count the first block in each > file. Perl has the little-known ?? counterpart to // which will only > match once. So making that line: > > next unless ?<h1>? .. /labelsub/ || eof; > > Allows you to get rid of the $capture_counter variable. But you'll need > to add a reset to the continue block, to reset the ?? at the start of a > new file. > > Finally, with this change you may as well just print $accumulator in the > continue block too. So we end up with > > my $accumulator; > > while ( <> ) { > next unless ?<h1>? .. /labelsub/ || eof; > $accumulator .= $_ unless /labelsub/; > } > continue { # flush out the variables and clean up > if ( eof ) { > print $accumulator; > $accumulator = ''; > reset; > } > } > > which, I think, does what you are after. > > The docs mention that ?? is vaguely deprecated: > > This usage is vaguely deprecated, which means it just might possibly > be removed in some distant future version of Perl, perhaps somewhere > around the year 2168. > > That doesn't sound too bad, but there was some talk of an earlier > deprecation of the bare ?? syntax, so it might be safer to use m?? > instead. > > Interestingly (for me), this is the first time in over 20 years that I > have found a legitimate use for ??, and the associated reset. > > -- > Paul Johnson - p...@pjcj.net > http://www.pjcj.net >