I've only recently discovered PEG; I've read several papers about it
(including Ford's seminal 2004 paper), and experimented with the Aurochs
online demo to get a better feel for it.
I'm interested mainly in applying PEG to text searching and
manipulation, i.e., the sorts of things regular expressions are commonly
used for. It seems likely that this presents a different set of use
cases than arise for the designers of programming languages (which seems
to be the focus of most PEG work I've found so far). These different
use cases may be leading me to ask heretical questions:
1. Why should the repetition operators be greedy by default? Ford's
paper supports this with a simple assertion ("Longest-match parsing is
almost always the desired behavior where options or repetition occur in
practical machine-oriented languages"), but in my experience in the
text-parsing world, greedy matchi is almost always NOT what is wanted,
and frequently leads new RE users into pitfalls (until they discover the
lazy modifier).
For example, a new RE user trying to match text between <b> and </b>
tags in a document would try <b>(.+)</b>, which works fine as long as
there is only one such tagged phrase. But if there are two, then it
will grab the opening tag of the first phrase, and the closing tag of
the last one, with everything in between. You have to use something
like <b>(.+?)</b> to turn off greedy matching, and get the desired
result. Most test matching cases are like this, in my experience.
So: what terrible things would result if PEG chose non-greedy matching
by default instead?
2. Even more problematic is PEG's inability to backtrack on its greedy
matching. So a simple pattern such as "everything between <b> and </b>
tags" can't be expressed like this:
"<b>" { .+ } "</b>"
but must instead be expressed like this:
"<b>" { (~"</b>" .)+ } "</b>"
Yet the equivalent RE, <b>(.+)</b>, works fine because the .* part will
backtrack as needed to make the subsequent </b> match. Is there any
strong reason why PEG shouldn't do the same?
(Indeed, it seems that with PEG as defined, the expressions ".+" or ".*"
must never be useful, because they will always match everything to the
end of the file. Or have I missed something?)
Please be patient with me if I'm asking stupid questions... there are
probably very good reasons for these design decisions, and I'm eager to
understand them.
Thanks,
- Joe
_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg