I've only recently discovered PEG; I've read several papers about it (including Ford's seminal 2004 paper), and experimented with the Aurochs online demo to get a better feel for it.

I'm interested mainly in applying PEG to text searching and manipulation, i.e., the sorts of things regular expressions are commonly used for. It seems likely that this presents a different set of use cases than arise for the designers of programming languages (which seems to be the focus of most PEG work I've found so far). These different use cases may be leading me to ask heretical questions:

1. Why should the repetition operators be greedy by default? Ford's paper supports this with a simple assertion ("Longest-match parsing is almost always the desired behavior where options or repetition occur in practical machine-oriented languages"), but in my experience in the text-parsing world, greedy matchi is almost always NOT what is wanted, and frequently leads new RE users into pitfalls (until they discover the lazy modifier).

For example, a new RE user trying to match text between <b> and </b> tags in a document would try <b>(.+)</b>, which works fine as long as there is only one such tagged phrase. But if there are two, then it will grab the opening tag of the first phrase, and the closing tag of the last one, with everything in between. You have to use something like <b>(.+?)</b> to turn off greedy matching, and get the desired result. Most test matching cases are like this, in my experience.

So: what terrible things would result if PEG chose non-greedy matching by default instead?

2. Even more problematic is PEG's inability to backtrack on its greedy matching. So a simple pattern such as "everything between <b> and </b> tags" can't be expressed like this:

   "<b>" { .+ } "</b>"

but must instead be expressed like this:

   "<b>" { (~"</b>" .)+ } "</b>"

Yet the equivalent RE, <b>(.+)</b>, works fine because the .* part will backtrack as needed to make the subsequent </b> match. Is there any strong reason why PEG shouldn't do the same?

(Indeed, it seems that with PEG as defined, the expressions ".+" or ".*" must never be useful, because they will always match everything to the end of the file. Or have I missed something?)

Please be patient with me if I'm asking stupid questions... there are probably very good reasons for these design decisions, and I'm eager to understand them.

Thanks,
- Joe

_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Reply via email to