Sorry, replying to myself, but I just stumbled across a similar
situation and my solution might help you too.

I needed to define a block like this:

perl until FLAG
   PERL
FLAG;

which is like a 'here-doc' for inlining perl in another language that
doesn't require actually parsing the perl code. Like your input, I
need to match 'anything' up to the closing flag. I ended up using a
rule similar to your original solution, except instead of having a
/.*?/ match, I combined that with the next terminal. After playing
around a bit, I came up with the following test script that parses out
all valid chunks between 'START' and 'END' amongst other rubbish in
the input in one pass:

====== START CODE ======

use Parse::RecDescent;
use Data::Dumper;
#$::RD_TRACE = 1;

# assuming start/end delimiters of START and END
my $grammar = <<'STOP';

start:
    chunk(s?)

chunk:
    /.*?START/s command(s) 'END'  # This is the important bit
    {$item[2]}

command:
    'test' ';'
    {"TEST COMMAND"}

STOP

my $text = << 'STOP';

blah blah blayh

asdsd kjkl

START
  test;
  test;
END

kjsaljdlk
askd

START
  test;
END

sad
asdgfdsf
gfsfg

STOP

my $res = Parse::RecDescent->new($grammar)->start($text);
print Data::Dumper::Dumper($res), "\n";

====== END CODE ======

Note that the /s modifier on the 'garbage scooping' re's is important
for this to work. Was scratching my head over that for a bit :)

The output of that is:

====== START OUTPUT ======

$VAR1 = [
          [
            'TEST COMMAND',
            'TEST COMMAND'
          ],
          [
            'TEST COMMAND'
          ]
        ];

====== END OUTPUT ======

I haven't done any benchmarking, but that might be faster than
sequential parses of 'clean' data. My original solution anchored to
the end of the input with an eof marker and a 'trailing_guff' rule
that matched anything after the chunk(s?) subrule, but that is
unnecessary.

MB

2009/9/4 Matthew Braid <mattyb...@gmail.com>:
> Hi all,
>
> Would there be some way of manipulating the skip re to do this?
>
> Something along the lines of:
>
> top: <skip: /NOT START DELIMETER/> chunk(s) eof
> chunk: delimeter_start <skip: /NORMAL SKIP/> command(s) delimiter_end
> eof: /\Z/
>
> The problem there is defining a skip that won't skip a
> delimeter_start. This probably won't allow delimeter_start to _not_
> mean the start of a set of commands as well.
>
> Not tested, but just a suggestion.
>
> MB
>
> 2009/9/4 Mike Diehl <mdi...@diehlnet.com>:
>> On Thursday 03 September 2009 01:50:58 Damian Conway wrote:
>>> Hi Mike,
>>>
>>> > What I've tried amounts to this:
>>> >
>>> > chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/
>>>
>>> Unfortunately that won't work, because every regex in a PRD grammar is
>>> independent of the rest of the grammar, so even a minimal-matching .*?
>>> eats everything.
>>
>> Ya, that's what I was suspecting.  In hind sight, I should have figured that;
>> that's how I'd write it...
>>
>>> Is there some reason you can't use something like:
>>>
>>>     my $parser = Parse::RecDescent->new($grammar);
>>>
>>>     $text =~ s{<DELIMITER> (.*?) </DELIMITER>}
>>>             { $parser->parse($1); q{} }gexs;
>>
>> That's what I was doing, but it seems I misinterpreted my profiling results.
>> I found from profiling that the function I use to create (once) and run the
>> parser accounted for 80% of runtime.
>>
>> I assumed that since I only create the parser once (if !defined), creating 
>> the
>> parser wasn't where the cost was.  So I decided that it must be due to
>> actually running the parser, which might run several times during program
>> execution.  My conclusion was that I needed to rewrite the grammar so that
>> the parser would only run once.
>>
>> It sounds like I may need to go back to the old algorithm and start tuning 
>> the
>> grammar.
>>
>> --
>>
>> Take care and have fun,
>> Mike Diehl.
>>
>

Reply via email to