[PEG] comments & questions about greediness and backtracking

Joe Strout Mon, 08 Aug 2011 16:50:07 -0700

I've only recently discovered PEG; I've read several papers about it(including Ford's seminal 2004 paper), and experimented with the Aurochsonline demo to get a better feel for it.

I'm interested mainly in applying PEG to text searching andmanipulation, i.e., the sorts of things regular expressions are commonlyused for. It seems likely that this presents a different set of usecases than arise for the designers of programming languages (which seemsto be the focus of most PEG work I've found so far). These differentuse cases may be leading me to ask heretical questions:

1. Why should the repetition operators be greedy by default? Ford'spaper supports this with a simple assertion ("Longest-match parsing isalmost always the desired behavior where options or repetition occur inpractical machine-oriented languages"), but in my experience in thetext-parsing world, greedy matchi is almost always NOT what is wanted,and frequently leads new RE users into pitfalls (until they discover thelazy modifier).

For example, a new RE user trying to match text between and tags in a document would try (.+), which works fine as long asthere is only one such tagged phrase. But if there are two, then itwill grab the opening tag of the first phrase, and the closing tag ofthe last one, with everything in between. You have to use somethinglike (.+?) to turn off greedy matching, and get the desiredresult. Most test matching cases are like this, in my experience.

So: what terrible things would result if PEG chose non-greedy matchingby default instead?

2. Even more problematic is PEG's inability to backtrack on its greedymatching. So a simple pattern such as "everything between and tags" can't be expressed like this:


   "<b>" { .+ } "</b>"

but must instead be expressed like this:

   "<b>" { (~"</b>" .)+ } "</b>"

Yet the equivalent RE, (.+), works fine because the .* part willbacktrack as needed to make the subsequent match. Is there anystrong reason why PEG shouldn't do the same?

(Indeed, it seems that with PEG as defined, the expressions ".+" or ".*"must never be useful, because they will always match everything to theend of the file. Or have I missed something?)

Please be patient with me if I'm asking stupid questions... there areprobably very good reasons for these design decisions, and I'm eager tounderstand them.


Thanks,
- Joe

_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

[PEG] comments & questions about greediness and backtracking

Reply via email to