Re: [PEG] comments & questions about greediness and backtracking

Peter Cashin Mon, 08 Aug 2011 18:02:35 -0700

Hi Joe:

I think they are very good questions. I have pondered variations of that
kind of question for many years....


A few comments spring to mind, others may have far better answers:

- Greedy matching does seem to be most useful for parsing languages, but
that does not mean that non-greedy matching is not a good feature.

- PEG is very sparse on features and that is good. So the question boils
down to how much you value more features.

- As you point out in your second example you can emulate a non-greedy match
with a greedy match using a not-predicate.  You would have to use recursion
to emulate a greedy match if you only had a non-greed one, which would be a
pity.  Both would be fine.

- Yes PEG users seem mostly interested in larger language grammars, and in
that case it is very bad news to parse the whole document before backing off
to a short local match. For a short string match a simple but sloppy RE can
do this without a problem: many users don't notice and may not even know
what's happening (shudder).

- Most terminal rules use matches that a RE can match, and grammar rules
could use an RE. Compiler folks often break the terminal matching out into a
lexer parse that generates a token stream.

- Most terminals want greedy matching without backtracking: a number is
never just some of the digits and a name is never just some of its
characters!  RE and classic CFG will try all possibilities on failure which
is silly, and that's another reason compiler folks use a lexer parse first.

- PEG avoids that problem, so integration of terminal matching and nested
grammar elements are easily integrated into one grammar: much better I
think.

- The !not predicate is rather tricky, and I think PEG got it right. It
looks a bit clumsy is the case you give, but it is very flexible. The EBNF
grammar rules for XML use a different form of negation -- but it is much
more cumbersome and not as flexible.

- BTW, using a . or _ to match ANY character is often very handy in a RE,
but it can be bad news in a grammar. In fact the XML grammar had errors for
this very reason. The problem arises with what you consider ANY character
include (Unicode even defines characters that are officially
not-a-character!).  A grammar rule can define an ANY class for your
particular grammar, and often rules for almost-any for other cases.

- I've played with various ways to write a neater version of the use of
negation in your second question, but I've never been too happy with the
results, and fall back to keep it simple.... but it would be nice!

Enough with a few comments!  Too much fun.  Hope they helped.

Cheers,
Peter.


On Tue, Aug 9, 2011 at 11:49 AM, Joe Strout <j...@strout.net> wrote:

> I've only recently discovered PEG; I've read several papers about it
> (including Ford's seminal 2004 paper), and experimented with the Aurochs
> online demo to get a better feel for it.
>
> I'm interested mainly in applying PEG to text searching and manipulation,
> i.e., the sorts of things regular expressions are commonly used for.  It
> seems likely that this presents a different set of use cases than arise for
> the designers of programming languages (which seems to be the focus of most
> PEG work I've found so far).  These different use cases may be leading me to
> ask heretical questions:
>
> 1. Why should the repetition operators be greedy by default?  Ford's paper
> supports this with a simple assertion ("Longest-match parsing is almost
> always the desired behavior where options or repetition occur in practical
> machine-oriented languages"), but in my experience in the text-parsing
> world, greedy matchi is almost always NOT what is wanted, and frequently
> leads new RE users into pitfalls (until they discover the lazy modifier).
>
> For example, a new RE user trying to match text between <b> and </b> tags
> in a document would try <b>(.+)</b>, which works fine as long as there is
> only one such tagged phrase.  But if there are two, then it will grab the
> opening tag of the first phrase, and the closing tag of the last one, with
> everything in between.  You have to use something like <b>(.+?)</b> to turn
> off greedy matching, and get the desired result.  Most test matching cases
> are like this, in my experience.
>
> So: what terrible things would result if PEG chose non-greedy matching by
> default instead?
>
> 2. Even more problematic is PEG's inability to backtrack on its greedy
> matching.  So a simple pattern such as "everything between <b> and </b>
> tags" can't be expressed like this:
>
>   "<b>" { .+ } "</b>"
>
> but must instead be expressed like this:
>
>   "<b>" { (~"</b>" .)+ } "</b>"
>
> Yet the equivalent RE, <b>(.+)</b>, works fine because the .* part will
> backtrack as needed to make the subsequent </b> match.  Is there any strong
> reason why PEG shouldn't do the same?
>
> (Indeed, it seems that with PEG as defined, the expressions ".+" or ".*"
> must never be useful, because they will always match everything to the end
> of the file.  Or have I missed something?)
>
> Please be patient with me if I'm asking stupid questions... there are
> probably very good reasons for these design decisions, and I'm eager to
> understand them.
>
> Thanks,
> - Joe
>
> ______________________________**_________________
> PEG mailing list
> PEG@lists.csail.mit.edu
> https://lists.csail.mit.edu/**mailman/listinfo/peg<https://lists.csail.mit.edu/mailman/listinfo/peg>
>

_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Re: [PEG] comments & questions about greediness and backtracking

Reply via email to