Dear Joe, You are correct in assessing that .+ and .* expressions will probably never be useful. PEGs are completely different from regular expressions -- don't be misled by the syntactic similarities.
To answer your first question, "Why should the repetition operators be greedy by default?" Because there is no backtracking*. So X* would always match the empty string, and X+ would always match X. To answer your second question, "Is there any strong reason why PEG shouldn't [backtrack on its greedy matching]?" Yes. In PEGs, there is a property of any parsing expression, and it is this: a parsing expression has only one possible match for any given string. This property enables packrat parsing. If you add backtracking*, you throw this property away. Cheers, - Francisco Mota * When I say there is no backtracking, I generally mean that the property above holds: you cannot go back and change the match of a given parsing expression on the same string. In my opinion, the prioritized choice operator does not give you backtracking (as many say or seem to imply) ... you are just looking ahead, really. On Tue, Aug 9, 2011 at 12:49 AM, Joe Strout <j...@strout.net> wrote: > I've only recently discovered PEG; I've read several papers about it > (including Ford's seminal 2004 paper), and experimented with the Aurochs > online demo to get a better feel for it. > > I'm interested mainly in applying PEG to text searching and manipulation, > i.e., the sorts of things regular expressions are commonly used for. It > seems likely that this presents a different set of use cases than arise for > the designers of programming languages (which seems to be the focus of most > PEG work I've found so far). These different use cases may be leading me to > ask heretical questions: > > 1. Why should the repetition operators be greedy by default? Ford's paper > supports this with a simple assertion ("Longest-match parsing is almost > always the desired behavior where options or repetition occur in practical > machine-oriented languages"), but in my experience in the text-parsing > world, greedy matchi is almost always NOT what is wanted, and frequently > leads new RE users into pitfalls (until they discover the lazy modifier). > > For example, a new RE user trying to match text between <b> and </b> tags > in a document would try <b>(.+)</b>, which works fine as long as there is > only one such tagged phrase. But if there are two, then it will grab the > opening tag of the first phrase, and the closing tag of the last one, with > everything in between. You have to use something like <b>(.+?)</b> to turn > off greedy matching, and get the desired result. Most test matching cases > are like this, in my experience. > > So: what terrible things would result if PEG chose non-greedy matching by > default instead? > > 2. Even more problematic is PEG's inability to backtrack on its greedy > matching. So a simple pattern such as "everything between <b> and </b> > tags" can't be expressed like this: > > "<b>" { .+ } "</b>" > > but must instead be expressed like this: > > "<b>" { (~"</b>" .)+ } "</b>" > > Yet the equivalent RE, <b>(.+)</b>, works fine because the .* part will > backtrack as needed to make the subsequent </b> match. Is there any strong > reason why PEG shouldn't do the same? > > (Indeed, it seems that with PEG as defined, the expressions ".+" or ".*" > must never be useful, because they will always match everything to the end > of the file. Or have I missed something?) > > Please be patient with me if I'm asking stupid questions... there are > probably very good reasons for these design decisions, and I'm eager to > understand them. > > Thanks, > - Joe > > ______________________________**_________________ > PEG mailing list > PEG@lists.csail.mit.edu > https://lists.csail.mit.edu/**mailman/listinfo/peg<https://lists.csail.mit.edu/mailman/listinfo/peg> >
_______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg