Grammar Rules For Inset Paragraphs
The Haskell rules look quite complex and there are lots of issues that I will
ignore, but I hope the essence of the following approach may be helpful to you.
A grammar can specify a simple inset paragraph if it has the ability to specify
that the inset of a line matches the `same` inset as the first line of the
But this ability to specify the `same` match as a previous production is not
possible with a CFG or a PEG. There have been various grammar language
extension mechanisms and work-arounds used to cope with this problem.
The same problem appears in an even more obvious way in the XML grammar where
the end tag name must match the `same` name as the start tag. The XML grammar
simply leaves this out of the grammar rules, it is a restriction that is pushed
out as a semantic check on the parse tree.
I have been using a grammar with an operator to extend a PEG or CFG with the
ability to match the `same` value as a previous match. The prefix operator
notation @x means that the rule must match the `same` value that the previous x
rule matched (I'll define this more fully shortly).
For example, a simple inset paragraph can be defined as:
para = inset line (eol @inset line)*
inset = (space | tab)+
line = char+
I have used a generic BNF grammar rule sybtax here, I'll only use mutually
exclusive choices in the examples here, so you can translate it into the syntax
of a PEG grammar or a CFG grammar as you like.
The first inset rule is a normal grammar production, and the @inset matches the
`same` match value the previous inset rule matched. The inset has been
specified in this case as one or more space or tab, and this requires that the
exact `same` inset is used on each line. The issues of tabs, and the tar-pit of
mixing them with spaces is legend, but I don't want to get side-tracked into
that issue, there is no good answer, and lots of pragmatic design for different
applications. It would be really nice if we could insist on only tab characters
for insets, but often that is seen as too rigid.
Another example of a `same` match is an XML element tag name match:
element = '<' tag '>' content '</' @tag '>'
tag = name
content = (element | text)*
The first tag is a normal match, but the second (end tag) is defined as @tag
meaning the `same` match that the first (start tag) matched.
Notice that the content of the element may contain nested elements, and the
@tag has to match the start tag of the same element (not just the last value
that any tag matched).
The @x operator is specified as a syntax tree reference, the `same` match is
defined as the last previous sibling tag node in the (partially generated)
It is quite easy to implement this in practice since it is just a relative
parse tree node value reference (used within the parser), albeit that the parse
tree is not yet fully constructed.
To fully define the @x operator we need to specify what happens if there is no
previous sibling node in the parse tree. This is defined to mean that the @x is
an empty match (i.e the empty string '' in the concrete grammar, or the e
symbol in the formal grammar).
This has the interesting and useful property that a nested paragraphs can now
be defined. A nested paragraph has to have a `greater` inset than the current
para = inset line (eol (@inset line | para))*
inset = @inset (space | tab)+
line = char+
The first time the inset rule is applied it will not find an @inset match for a
prior inset rule, and will therefore just match one or more space or tab. A
subsequent nested paragraph must have a greater inset, since its inset has to
match the current inset (the @inset) followed by one or more space or tab.
I have used this @x operator in playing around with grammar rules for many
years and have found it works quite well for a variety of nested inset
paragraph grammars that I have used (and other things).
It is ironic that the IETF RFC for the ABNF specification says that ABNF rules
may be continued on the next line if they have a greater inset than the first
line (of the rule). Of course the ABNF grammar for ABNF could not include this
specification! They could have simply insisted that ABNF rules did not start
with an inset, and then a continuation line could be defined as an eol followed
by one or more tab or space. But the definition of a greater inset is so much
better; its just a pity that it can't be specifeied in any CFG (or PEG), and
this is often an real grammar deficiency in practice.
The notation can be generalised a little further if desired, to allow the
`same` previous match to be applied to a child in a previous production, so
that @x.y matches the `same` value that the prior child y of x, matched:
element = start content end
start = '<' tag '>'
end = '</' @start.tag '>'
tag = name
content = (element | text)*
The @x.y notation searches back though the parse tree for the first previous x
node sibling, or a parents sibling. The match is the value of the child x of
this y node.
The @x.y notation is quite like an Xpath expression, but I have used @x.y
rather than @x/y to avoid confusion with the PEG choice operator (and trying to
use an Xpath backwards axis expression for this purpose is very ugly!). As an
aside, the method used by the @x operator can be included as a normal parse
tree access method that can also be used by an application after the parser has
generated the full parse tree.
I see that other people have used parameterised rules to define grammar rules
with a variable inset size (eg the YAML grammar). But I don't think this
appraoch can be implemented nearly as easily. I like my simple @x operator much
---- John Meacham <[EMAIL PROTECTED]> wrote:
> Hi, I am the author of the jhc haskell compiler as well as the frisby
> packrat parsing library for haskell.
> I would like to parse all of haskell via a packrat parser, generated by
> pappy, However, Haskell has a 'layout' rule which is causing some
> trouble. basically, in haskell, blocks are defined by layout.
> do foo
> let x = ...
> y = ...
> the 'do' and the 'let' both define a new layout context. I was wondering
> if anyone had any ideas on how to extend pappy in order to deal with
> the full specification of the layout rule is here in section 9.3:
> note that it cannot be carried out purely on the lexical level, as it
> depends on matching against a 'parse error' in one of its cases.
> adding a primitive that binds to the current column and consumes no
> input would be easy, but the "stack of layout contexts" seems to be a
> stumbling block.
> John Meacham - ⑆repetae.net⑆john⑈
> PEG mailing list
PEG mailing list