Grammar Rules For Inset Paragraphs

The Haskell rules look quite complex and there are lots of issues that I will 
ignore, but I hope the essence of the following approach may be helpful to you.

A grammar can specify a simple inset paragraph if it has the ability to specify 
that the inset of a line matches the `same` inset as the first line of the 
paragraph.

But this ability to specify the `same` match as a previous production is not 
possible with a CFG or a PEG. There have been various grammar language 
extension mechanisms and work-arounds used to cope with this problem.

The same problem appears in an even more obvious way in the XML grammar where 
the end tag name must match the `same` name as the start tag. The XML grammar 
simply leaves this out of the grammar rules, it is a restriction that is pushed 
out as a semantic check on the parse tree.

I have been using a grammar with an operator to extend a PEG or CFG with the 
ability to match the `same` value as a previous match. The prefix operator 
notation @x means that the rule must match the `same` value that the previous x 
rule matched (I'll define this more fully shortly). 

For example, a simple inset paragraph can be defined as:

        para = inset line (eol @inset line)*
        inset = (space | tab)+
        line = char+

I have used a generic BNF grammar rule sybtax here, I'll only use mutually 
exclusive choices in the examples here, so you can translate it into the syntax 
of a PEG grammar or a CFG grammar as you like.

The first inset rule is a normal grammar production, and the @inset matches the 
`same` match value the previous inset rule matched. The inset has been 
specified in this case as one or more space or tab, and this requires that the 
exact `same` inset is used on each line. The issues of tabs, and the tar-pit of 
mixing them with spaces is legend, but I don't want to get side-tracked into 
that issue, there is no good answer, and lots of pragmatic design for different 
applications. It would be really nice if we could insist on only tab characters 
for insets, but often that is seen as too rigid.

Another example of a `same` match is an XML element tag name match:

        element = '<' tag '>' content '</' @tag '>'
        tag = name
        content = (element | text)*

The first tag is a normal match, but the second (end tag) is defined as @tag 
meaning the `same` match that the first (start tag) matched.

Notice that the content of the element may contain nested elements, and the 
@tag has to match the start tag of the same element (not just the last value 
that any tag matched).

The @x operator is specified as a syntax tree reference, the `same` match is 
defined as the last previous sibling tag node in the (partially generated) 
parse tree.

It is quite easy to implement this in practice since it is just a relative 
parse tree node value reference (used within the parser), albeit that the parse 
tree is not yet fully constructed.

To fully define the @x operator we need to specify what happens if there is no 
previous sibling node in the parse tree. This is defined to mean that the @x is 
an empty match (i.e the empty string '' in the concrete grammar, or the e 
symbol in the formal grammar).

This has the interesting and useful property that a nested paragraphs can now 
be defined. A nested paragraph has to have a `greater` inset than the current 
paragraph:

        para = inset line (eol (@inset line | para))*
        inset = @inset (space | tab)+
        line = char+

The first time the inset rule is applied it will not find an @inset match for a 
prior inset rule, and will therefore just match one or more space or tab. A 
subsequent nested paragraph must have a greater inset, since its inset has to 
match the current inset (the @inset) followed by one or more space or tab. 

I have used this @x operator in playing around with grammar rules for many 
years and have found it works quite well for a variety of nested inset 
paragraph grammars that I have used (and other things).

It is ironic that the IETF RFC for the ABNF specification says that ABNF rules 
may be continued on the next line if they have a greater inset than the first 
line (of the rule). Of course the ABNF grammar for ABNF could not include this 
specification! They could have simply insisted that ABNF rules did not start 
with an inset, and then a continuation line could be defined as an eol followed 
by one or more tab or space. But the definition of a greater inset is so much 
better; its just a pity that it can't be specifeied in any CFG (or PEG), and 
this is often an real grammar deficiency in practice.

The notation can be generalised a little further if desired, to allow the 
`same` previous match to be applied to a child in a previous production, so 
that @x.y matches the `same` value that the prior child y of x, matched:

        element = start content end
        start = '<' tag '>'
        end = '</' @start.tag '>'
        tag = name
        content = (element | text)*

The @x.y notation searches back though the parse tree for the first previous x 
node sibling, or a parents sibling. The match is the value of the child x of 
this y node.

The @x.y notation is quite like an Xpath expression, but I have used @x.y 
rather than @x/y to avoid confusion with the PEG choice operator (and trying to 
use an Xpath backwards axis expression for this purpose is very ugly!). As an 
aside, the method used by the @x operator can be included as a normal parse 
tree access method that can also be used by an application after the parser has 
generated the full parse tree.

I see that other people have used parameterised rules to define grammar rules 
with a variable inset size (eg the YAML grammar). But I don't think this 
appraoch can be implemented nearly as easily. I like my simple @x operator much 
better :-)

Peter.


---- John Meacham <[EMAIL PROTECTED]> wrote: 
> Hi, I am the author of the jhc haskell compiler as well as the frisby
> packrat parsing library for haskell.
> 
> I would like to parse all of haskell via a packrat parser, generated by
> pappy, However, Haskell has a 'layout' rule which is causing some
> trouble. basically, in haskell, blocks are defined by layout.
> 
> do foo
>    bar
>    let x = ...
>        y = ...
>    baz
> 
> the 'do' and the 'let' both define a new layout context. I was wondering
> if anyone had any ideas on how to extend pappy in order to deal with
> this? 
> 
> the full specification of the layout rule is here in section 9.3:
> 
> http://haskell.org/onlinereport/syntax-iso.html                               
>                          
> 
> note that it cannot be carried out purely on the lexical level, as it
> depends on matching against a 'parse error' in one of its cases.
> 
> adding a primitive that binds to the current column and consumes no
> input would be easy, but the "stack of layout contexts" seems to be a
> stumbling block.
> 
>         John
> 
> 
> -- 
> John Meacham - ⑆repetae.net⑆john⑈
> 
> _______________________________________________
> PEG mailing list
> PEG@lists.csail.mit.edu
> https://lists.csail.mit.edu/mailman/listinfo/peg


_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Reply via email to